\StaticSiteUrlList
Synopsis
class StaticSiteUrlList
{
- // members
- public static string $undefined_mime_type = 'unknown';
- protected $baseURL;
- protected $cacheDir;
- protected $urls = NULL;
- protected boolean $autoCrawl = false;
- protected StaticSiteUrlProcessor $urlProcessor = NULL;
- protected $extraCrawlURLs = NULL;
- protected array $excludePatterns = ;
- protected StaticSiteContentSource $source;
- // methods
- public void __construct()
- public void setUrlProcessor()
- public void setExtraCrawlURls()
- public array getExtraCrawlURLs()
- public void setExcludePatterns()
- public array getExcludePatterns()
- public void setAutoCrawl()
- public string getSpiderStatus()
- public mixed getRawCacheData()
- public null getNumURLs()
- public array getProcessedURLs()
- public boolean hasCrawled()
- public void loadUrls()
- public void reprocessUrls()
- public StaticSiteCrawler crawl()
- public void saveURLs()
- public void addAbsoluteURL()
- public void addURL()
- public void addInferredURL()
- public boolean hasURL()
- public string simplifyURL()
- public boolean hasProcessedURL()
- public string parentProcessedURL()
- public array processedURL()
- public array generateProcessedURL()
- public array getChildren()
- public mixed getProperty()
- public string getCacheFileContents()
Tasks
Line | Task |
---|---|
681+ | implement to replace x3 refs to unserialize(file_get_contents($this->cacheDir . 'urls')); |
Members
protected
- $autoCrawl — boolean
- $baseURL — string
- $cacheDir — string
- $excludePatterns
—
array
A list of regular expression patterns to exclude from scraping - $extraCrawlURLs — array
- $source
—
StaticSiteContentSource
The StaticSiteContentSource object - $urlProcessor — StaticSiteUrlProcessor
- $urls
—
array
Two element array: contains keys 'inferred' and 'regular': - 'regular' is an array mapping raw URLs to processed URLs - 'inferred' is an array of inferred URLs
public
- $undefined_mime_type — string
Methods
public
- __construct() — Create a new URL List
- addAbsoluteURL() — Add a URL to this list, given the absolute URL.
- addInferredURL() — Add an inferred URL to the list.
- addURL() — Appends a processed URL onto the URL cache.
- crawl()
- generateProcessedURL() — Execute custom logic for processing URLs prior to heirachy generation.
- getCacheFileContents() — Get the serialized cache content and return the unserialized string
- getChildren() — Return the URLs that are a child of the given URL
- getExcludePatterns() — Get an array of regular expression patterns that should not be added to the url list.
- getExtraCrawlURLs() — Return the additional crawl URLs as an array
- getNumURLs() — Return the number of URLs crawled so far
- getProcessedURLs() — Return a map of URLs crawled, with raw URLs as keys and processed URLs as values
- getProperty() — Simple property getter. Used in unit-testing.
- getRawCacheData() — Raw URL+Mime data accessor method, used internally by logic outside of the class.
- getSpiderStatus() — Returns the status of the spidering: "Complete", "Partial", or "Not started".
- hasCrawled() — There are URLs and we're not in the middle of a crawl
- hasProcessedURL() — Returns true if the given URL is in the list of processed URls
- hasURL() — Return true if the given URL exists.
- loadUrls() — Load the URLs, either by crawling, or by fetching from cache.
- parentProcessedURL() — Return the processed URL that is the parent of the given one.
- processedURL() — Find the processed URL in the URL list
- reprocessUrls() — Re-execute the URL processor on all the fetched URLs.
- saveURLs() — Cache the current list of URLs to disk.
- setAutoCrawl() — Set whether the crawl should be triggered on demand.
- setExcludePatterns() — Set an array of regular expression patterns that should be excluded from being added to the url list.
- setExtraCrawlURls() — Define additional crawl URLs as an array Each of these URLs will be crawled in addition the base URL.
- setUrlProcessor() — Set a URL processor for this URL List.
- simplifyURL() — Simplify a URL. Ignores https/http differences and "www." / non differences.