\StaticSiteUrlList
Represents a set of URLs parsed from a site.
Makes use of PHPCrawl to prepare a list of URLs on the site
Synopsis
class StaticSiteUrlList
{
- // members
- protected $baseURL;
- protected $urls = NULL;
- protected boolean $autoCrawl = false;
- protected $urlProcessor = NULL;
- protected $extraCrawlURLs = NULL;
- protected array $excludePatterns = ;
- // methods
- public void __construct()
- public void setUrlProcessor()
- public void setExtraCrawlURls()
- public void getExtraCrawlURLs()
- public void setExcludePatterns()
- public array getExcludePatterns()
- public void setAutoCrawl()
- public [type] getSpiderStatus()
- public void getNumURLs()
- public array getRawURLs()
- public array getProcessedURLs()
- public void hasCrawled()
- public void loadUrls()
- public void reprocessUrls()
- public StaticSiteCrawler crawl()
- public [type] saveURLs()
- public void addAbsoluteURL()
- public void addURL()
- public void addInferredURL()
- public boolean hasURL()
- protected string simplifyURL()
- public boolean hasProcessedURL()
- public string parentProcessedURL()
- public string unprocessedURL()
- public [type] processedURL()
- public string generateProcessedURL()
- public [type] getChildren()
Members
protected
- $autoCrawl
- $baseURL
- $excludePatterns
—
array
A list of regular expression patterns to exclude from scraping - $extraCrawlURLs
- $urlProcessor
-
$urls
Two element array: contains keys 'inferred' and 'regular': - 'regular' is an array mapping raw URLs to processed URLs - 'inferred' is an array of inferred URLs
Methods
protected
- simplifyURL() — Simplify a URL.
public
- __construct() — Create a new URL List
- addAbsoluteURL() — Add a URL to this list, given the absolute URL
- addInferredURL() — Add an inferred URL to the list.
- addURL()
- crawl()
- generateProcessedURL() — Execute custom logic for processing URLs prior to heirachy generation.
- getChildren() — Return the URLs that are a child of the given URL
- getExcludePatterns() — Get an array of regular expression patterns that should not be added to the url list
- getExtraCrawlURLs() — Return the additional crawl URLs as an array
- getNumURLs() — Return the number of URLs crawled so far
- getProcessedURLs() — Return a map of URLs crawled, with raw URLs as keys and processed URLs as values
- getRawURLs() — Return the raw URLs as an array
- getSpiderStatus() — Returns the status of the spidering: "Complete", "Partial", or "Not started"
- hasCrawled()
- hasProcessedURL() — Returns true if the given URL is in the list of processed URls
- hasURL() — Return true if the given URL exists
- loadUrls() — Load the URLs, either by crawling, or by fetching from cache
- parentProcessedURL() — Return the processed URL that is the parent of the given one.
- processedURL() — Find the processed URL in the URL list
- reprocessUrls() — Re-execute the URL processor on all the fetched URLs
- saveURLs() — Save the current list of URLs to disk
- setAutoCrawl() — Set whether the crawl should be triggered on demand.
- setExcludePatterns() — Set an array of regular expression patterns that should be excluded from being added to the url list
- setExtraCrawlURls() — Define additional crawl URLs as an array Each of these URLs will be crawled in addition the base URL.
- setUrlProcessor() — Set a URL processor for this URL List.
- unprocessedURL() — Return the regular URL, given the processed one.