SilverStripe\TextExtraction\Extractor\HTMLTextExtractor
Text extractor that uses php function strip_tags to get just the text. OK for indexing, not the best for readable text.
- Author: mstephens
Synopsis
class HTMLTextExtractor
extends FileTextExtractor
{
- // members
- private static integer $priority = 10;
- // Inherited members from FileTextExtractor
- protected static $sorted_extractor_classes;
- // methods
- public boolean isAvailable()
- public array supportsExtension()
- public string supportsMime()
- public string getContent()
- // Inherited methods from FileTextExtractor
- protected static array get_extractor_classes()
- protected static FileTextExtractor get_extractor()
- public static FileTextExtractor|null for_file()
- protected static string getPathFromFile()
- public abstract boolean isAvailable()
- public abstract boolean supportsExtension()
- public abstract boolean supportsMime()
- public abstract string getContent()
Hierarchy
Members
private
- $priority
—
integer
Lower priority because its not the most clever HTML extraction. If there is something better, use it
protected
- $sorted_extractor_classes
—
array
Cache of extractor class names, sorted by priority
Methods
public
- getContent() — Extracts content from regex, by using strip_tags() combined with regular expressions to remove non-content tags like <style> or <script>, as well as adding line breaks after block tags.
- isAvailable()
- supportsExtension()
- supportsMime()
Inherited from SilverStripe\TextExtraction\Extractor\FileTextExtractor
protected
- getPathFromFile() — Some text extractors (like pdftotext) may require a physical file to read from, so write the current file contents to a temp file and return its path
- get_extractor() — Get the text file extractor for the given class
- get_extractor_classes() — Gets the list of prioritised extractor classes
public
- for_file() — Given a File object, decide which extractor instance to use to handle it
- getContent() — Given a File instance, extract the contents as text.
- isAvailable() — Checks if the extractor is supported on the current environment, for example if the correct binaries or libraries are available.
- supportsExtension() — Determine if this extractor supports the given extension.
- supportsMime() — Determine if this extractor supports the given mime type.