SilverStripe\TextExtraction\Extractor\HTMLTextExtractor

Text extractor that uses php function strip_tags to get just the text. OK for indexing, not the best for readable text.

Author: mstephens

Synopsis

class HTMLTextExtractor extends FileTextExtractor {

// members
private static integer $priority = 10;

// Inherited members from FileTextExtractor
protected static $sorted_extractor_classes;

// methods
public boolean isAvailable()
public array supportsExtension()
public string supportsMime()
public string getContent()

// Inherited methods from FileTextExtractor
protected static array get_extractor_classes()
protected static FileTextExtractor get_extractor()
public static FileTextExtractor|null for_file()
protected static string getPathFromFile()
public abstract boolean isAvailable()
public abstract boolean supportsExtension()
public abstract boolean supportsMime()
public abstract string getContent()

}

Hierarchy

Extends

SilverStripe\TextExtraction\Extractor\FileTextExtractor

Members

private

$priority — integer
Lower priority because its not the most clever HTML extraction. If there is something better, use it

protected

$sorted_extractor_classes — array
Cache of extractor class names, sorted by priority

Methods

public

getContent() — Extracts content from regex, by using strip_tags() combined with regular expressions to remove non-content tags like <style> or <script>, as well as adding line breaks after block tags.
isAvailable()
supportsExtension()
supportsMime()

Inherited from SilverStripe\TextExtraction\Extractor\FileTextExtractor

protected

getPathFromFile() — Some text extractors (like pdftotext) may require a physical file to read from, so write the current file contents to a temp file and return its path
get_extractor() — Get the text file extractor for the given class
get_extractor_classes() — Gets the list of prioritised extractor classes

public

for_file() — Given a File object, decide which extractor instance to use to handle it
getContent() — Given a File instance, extract the contents as text.
isAvailable() — Checks if the extractor is supported on the current environment, for example if the correct binaries or libraries are available.
supportsExtension() — Determine if this extractor supports the given extension.
supportsMime() — Determine if this extractor supports the given mime type.