\StaticSiteContentExtractor
This tool uses a combination of cURL and phpQuery to extract content from a URL.
The URL is first downloaded using cURL, and then passed into phpQuery for processing.
Given a set of fieldnames and CSS selectors corresponding to them, a map of content
fields will be returned.
If the URL represents a file-based Mime-Type, a File object is created and the
physical file it represents can then be post-processed and saved to the SS DB and F/S.
- Author: Sam Minee <sam@silverstripe.com>
Synopsis
class StaticSiteContentExtractor
extends Object
{
- // members
- protected $url = NULL;
- protected $mime = NULL;
- protected $content = NULL;
- protected phpQueryObject $phpQuery = NULL;
- protected string $tmpFileName = '';
- private static $log_file = NULL;
- protected StaticSiteMimeProcessor $mimeProcessor;
- protected Object $utils;
- // methods
- public void __construct()
- public array extractMapAndSelectors()
- public string extractField()
- protected string excludeContent()
- protected string getOuterHTML()
- public string getContent()
- protected void fetchContent()
- protected boolean curlRequest()
- public void setTmpFileName()
- public string getTmpFileName()
- public boolean isMimeHTML()
- public boolean isMimeFile()
- public boolean isMimeImage()
- public boolean isMimeFileOrImage()
- public void prepareContent()
Hierarchy
Extends
- Object
Tasks
Line | Task |
---|---|
174 | temporary workaround for File objects |
256+ | deal-to defaults when $this->mime isn't matched. |
286+ | Add checks when fetching multi Mb images to ignore anything over 2Mb?? |
Members
private
- $log_file
—
string
Set this by using the yml config system
protected
- $content — string
- $mime — string
- $mimeProcessor
—
StaticSiteMimeProcessor
"Caches" the mime-processor for use throughout - $phpQuery — phpQueryObject
- $tmpFileName — string
- $url — string
- $utils
—
Object
Holds the StaticSiteUtils object on construct
Methods
protected
- curlRequest() — Use cURL to request a URL, and return a SS_HTTPResponse object (`SiteTree`) or write curl output directly to a tmp file ready for uploading to SilverStripe via Upload#load() (`File` and `Image`)
- excludeContent() — Strip away content from $content that matches one or many css selectors.
- fetchContent() — Fetch the content, initialise $this->content and $this->phpQuery .
- getOuterHTML() — Get the full HTML of the element and its children
public
- __construct() — Create a StaticSiteContentExtractor for a single URL/.
- extractField() — Extract content for a single css selector
- extractMapAndSelectors() — Extract content for map of field => css-selector pairs
- getContent()
- getTmpFileName()
- isMimeFile()
- isMimeFileOrImage()
- isMimeHTML()
- isMimeImage()
- prepareContent() — Pre-process the content so phpQuery can parse it without violently barfing
- setTmpFileName()