\HTMLPurifier_Lexer_DOMLex

Parser that uses PHP 5's DOM extension (part of the core).

In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
It gives us a forgiving HTML parser, which we use to transform the HTML
into a DOM, and then into the tokens. It is blazingly fast (for large
documents, it performs twenty times faster than
HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.

Synopsis

class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer {

// members
private $factory;

// Inherited members from HTMLPurifier_Lexer
public boolean $tracksLineNumbers;
protected array $_special_entity2str;

// methods
public void __construct()
public void tokenizeHTML()
protected void tokenizeDOM()
protected void transformAttrToAssoc()
public void muteErrorHandler()
public void callbackUndoCommentSubst()
public void callbackArmorCommentEntities()
protected void wrapHTML()

// Inherited methods from HTMLPurifier_Lexer
public static Concrete create()
public void __construct()
public void parseData()
public HTMLPurifier_Token tokenizeHTML()
protected static void escapeCDATA()
protected static void escapeCommentedCDATA()
protected static void CDATACallback()
public void normalize()
public void extractBody()

}

Hierarchy

Extends

HTMLPurifier_Lexer

Tasks

Line	Task
252+	Consider making protected
286+	Consider making protected

Members

private

$factory

protected

$_special_entity2str
Most common entity to raw value conversion table for special entities.

public

$tracksLineNumbers
Whether or not this lexer implements line-number/column-number tracking.

Methods

protected

tokenizeDOM() — Recursive function that tokenizes a node, putting it into an accumulator.
transformAttrToAssoc() — Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.
wrapHTML() — Wraps an HTML fragment in the necessary HTML

public

__construct()
callbackArmorCommentEntities() — Callback function that entity-izes ampersands in comments so that callbackUndoCommentSubst doesn't clobber them
callbackUndoCommentSubst() — Callback function for undoing escaping of stray angled brackets in comments
muteErrorHandler() — An error handler that mutes all errors
tokenizeHTML()

Inherited from HTMLPurifier_Lexer

protected

CDATACallback() — Callback function for escapeCDATA() that does the work.
escapeCDATA() — Translates CDATA sections into regular sections (through escaping).
escapeCommentedCDATA() — Special CDATA case that is especially convoluted for <script>

public

create() — Retrieves or sets the default Lexer as a Prototype Factory.
extractBody() — Takes a string of HTML (fragment or document) and returns the content
normalize() — Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
parseData() — Parses special entities into the proper characters.
tokenizeHTML() — Lexes an HTML string into tokens.