\StaticSiteUrlList

Represents a set of URLs parsed from a site.

Makes use of PHPCrawl to prepare a list of URLs on the site

Synopsis

class StaticSiteUrlList {

// members
protected $baseURL;
protected $urls = NULL;
protected boolean $autoCrawl = false;
protected $urlProcessor = NULL;
protected $extraCrawlURLs = NULL;
protected array $excludePatterns = ;

// methods
public void __construct()
public void setUrlProcessor()
public void setExtraCrawlURls()
public void getExtraCrawlURLs()
public void setExcludePatterns()
public array getExcludePatterns()
public void setAutoCrawl()
public [type] getSpiderStatus()
public void getNumURLs()
public array getRawURLs()
public array getProcessedURLs()
public void hasCrawled()
public void loadUrls()
public void reprocessUrls()
public StaticSiteCrawler crawl()
public [type] saveURLs()
public void addAbsoluteURL()
public void addURL()
public void addInferredURL()
public boolean hasURL()
protected string simplifyURL()
public boolean hasProcessedURL()
public string parentProcessedURL()
public string unprocessedURL()
public [type] processedURL()
public string generateProcessedURL()
public [type] getChildren()

}

Members

protected

$autoCrawl
$baseURL
$excludePatterns — array
A list of regular expression patterns to exclude from scraping
$extraCrawlURLs
$urlProcessor
$urls
Two element array: contains keys 'inferred' and 'regular': - 'regular' is an array mapping raw URLs to processed URLs - 'inferred' is an array of inferred URLs

Methods

protected

simplifyURL() — Simplify a URL.

public

__construct() — Create a new URL List
addAbsoluteURL() — Add a URL to this list, given the absolute URL
addInferredURL() — Add an inferred URL to the list.
addURL()
crawl()
generateProcessedURL() — Execute custom logic for processing URLs prior to heirachy generation.
getChildren() — Return the URLs that are a child of the given URL
getExcludePatterns() — Get an array of regular expression patterns that should not be added to the url list
getExtraCrawlURLs() — Return the additional crawl URLs as an array
getNumURLs() — Return the number of URLs crawled so far
getProcessedURLs() — Return a map of URLs crawled, with raw URLs as keys and processed URLs as values
getRawURLs() — Return the raw URLs as an array
getSpiderStatus() — Returns the status of the spidering: "Complete", "Partial", or "Not started"
hasCrawled()
hasProcessedURL() — Returns true if the given URL is in the list of processed URls
hasURL() — Return true if the given URL exists
loadUrls() — Load the URLs, either by crawling, or by fetching from cache
parentProcessedURL() — Return the processed URL that is the parent of the given one.
processedURL() — Find the processed URL in the URL list
reprocessUrls() — Re-execute the URL processor on all the fetched URLs
saveURLs() — Save the current list of URLs to disk
setAutoCrawl() — Set whether the crawl should be triggered on demand.
setExcludePatterns() — Set an array of regular expression patterns that should be excluded from being added to the url list
setExtraCrawlURls() — Define additional crawl URLs as an array Each of these URLs will be crawled in addition the base URL.
setUrlProcessor() — Set a URL processor for this URL List.
unprocessedURL() — Return the regular URL, given the processed one.