\HTMLPurifier_Lexer_DOMLex
Parser that uses PHP 5's DOM extension (part of the core).
In PHP 5, the DOM XML extension was revamped into DOM and added to the core.
It gives us a forgiving HTML parser, which we use to transform the HTML
into a DOM, and then into the tokens. It is blazingly fast (for large
documents, it performs twenty times faster than
HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.
Synopsis
class HTMLPurifier_Lexer_DOMLex
extends HTMLPurifier_Lexer
{
- // members
- private $factory;
- // Inherited members from HTMLPurifier_Lexer
- public boolean $tracksLineNumbers;
- protected array $_special_entity2str;
- // methods
- public void __construct()
- public void tokenizeHTML()
- protected void tokenizeDOM()
- protected void transformAttrToAssoc()
- public void muteErrorHandler()
- public void callbackUndoCommentSubst()
- public void callbackArmorCommentEntities()
- protected void wrapHTML()
- // Inherited methods from HTMLPurifier_Lexer
- public static Concrete create()
- public void __construct()
- public void parseData()
- public HTMLPurifier_Token tokenizeHTML()
- protected static void escapeCDATA()
- protected static void escapeCommentedCDATA()
- protected static void CDATACallback()
- public void normalize()
- public void extractBody()
Hierarchy
Extends
Tasks
Line | Task |
---|---|
252+ | Consider making protected |
286+ | Consider making protected |
Members
private
- $factory
protected
-
$_special_entity2str
Most common entity to raw value conversion table for special entities.
public
-
$tracksLineNumbers
Whether or not this lexer implements line-number/column-number tracking.
Methods
protected
- tokenizeDOM() — Recursive function that tokenizes a node, putting it into an accumulator.
- transformAttrToAssoc() — Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.
- wrapHTML() — Wraps an HTML fragment in the necessary HTML
public
- __construct()
- callbackArmorCommentEntities() — Callback function that entity-izes ampersands in comments so that callbackUndoCommentSubst doesn't clobber them
- callbackUndoCommentSubst() — Callback function for undoing escaping of stray angled brackets in comments
- muteErrorHandler() — An error handler that mutes all errors
- tokenizeHTML()
Inherited from HTMLPurifier_Lexer
protected
- CDATACallback() — Callback function for escapeCDATA() that does the work.
- escapeCDATA() — Translates CDATA sections into regular sections (through escaping).
- escapeCommentedCDATA() — Special CDATA case that is especially convoluted for <script>
public
- create() — Retrieves or sets the default Lexer as a Prototype Factory.
- extractBody() — Takes a string of HTML (fragment or document) and returns the content
- normalize() — Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
- parseData() — Parses special entities into the proper characters.
- tokenizeHTML() — Lexes an HTML string into tokens.