\StaticSiteContentExtractor
This tool uses a combination of cURL and phpQuery to extract content from a URL.
The URL is first downloaded using cURL, and then passed into phpQuery for processing.
Given a set of fieldnames and CSS selectors corresponding to them, a map of content
fields will be returned.
Synopsis
class StaticSiteContentExtractor
extends Object
{
- // members
- protected $url = NULL;
- protected $content = NULL;
- protected phpQueryObject $phpQuery = NULL;
- private static $log_file = NULL;
- // methods
- public void __construct()
- public array extractMapAndSelectors()
- public string extractField()
- protected string excludeContent()
- protected string getOuterHTML()
- public string getContent()
- protected void fetchContent()
- protected SS_HTTPResponse curlRequest()
- protected void log()
Hierarchy
Extends
- Object
Members
private
- $log_file
—
string
Set this by using the yml config system
protected
- $content — string
- $phpQuery — phpQueryObject
- $url — string
Methods
protected
- curlRequest() — Use cURL to request a URL, and return a SS_HTTPResponse object.
- excludeContent() — Strip away content from $content that matches one or many css selectors.
- fetchContent() — Fetch the content and initialise $this->content and $this->phpQuery
- getOuterHTML() — Get the full HTML of the element and its childs
- log() — Log a message if the logging has been setup according to docs
public
- __construct() — Create a StaticSiteContentExtractor for a single URL/.
- extractField() — Extract content for a single css selector
- extractMapAndSelectors() — Extract content for map of field => css-selector pairs
- getContent()