\HTMLPurifier_Lexer

Forgivingly lexes HTML (SGML-style) markup into tokens.

A lexer parses a string of SGML-style markup and converts them into
corresponding tokens. It doesn't check for well-formedness, although its
internal mechanism may make this automatic (such as the case of
HTMLPurifier_Lexer_DOMLex). There are several implementations to choose
from.

A lexer is HTML-oriented: it might work with XML, but it's not
recommended, as we adhere to a subset of the specification for optimization
reasons. This might change in the future. Also, most tokenizers are not
expected to handle DTDs or PIs.

This class should not be directly instantiated, but you may use create() to
retrieve a default copy of the lexer. Being a supertype, this class
does not actually define any implementation, but offers commonly used
convenience functions for subclasses.

Synopsis

class HTMLPurifier_Lexer {

// members
public boolean $tracksLineNumbers = false;
protected array $_special_entity2str = ;

// methods
public static Concrete create()
public void __construct()
public void parseData()
public HTMLPurifier_Token tokenizeHTML()
protected static void escapeCDATA()
protected static void escapeCommentedCDATA()
protected static void removeIEConditional()
protected static void CDATACallback()
public void normalize()
public void extractBody()

}

Tasks

Line	Task
263+	Consider making protected
314+	Consider making protected

Members

protected

$_special_entity2str
Most common entity to raw value conversion table for special entities.

public

$tracksLineNumbers
Whether or not this lexer implements line-number/column-number tracking.

Methods

protected

CDATACallback() — Callback function for escapeCDATA() that does the work.
escapeCDATA() — Translates CDATA sections into regular sections (through escaping).
escapeCommentedCDATA() — Special CDATA case that is especially convoluted for <script>
removeIEConditional() — Special Internet Explorer conditional comments should be removed.

public

__construct()
create() — Retrieves or sets the default Lexer as a Prototype Factory.
extractBody() — Takes a string of HTML (fragment or document) and returns the content
normalize() — Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
parseData() — Parses special entities into the proper characters.
tokenizeHTML() — Lexes an HTML string into tokens.