\HTMLPurifier_Lexer_PEARSax3
Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML.
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know
very much about implementation, but it's fairly well written. However, that
abstraction comes at a price: performance. You need to have it installed,
and if the API changes, it might break our adapter. Not sure whether or not
it's UTF-8 aware, but it has some entity parsing trouble (in all areas,
text and attributes).
Quite personally, I don't recommend using the PEAR class, and the defaults
don't use it. The unit tests do perform the tests on the SAX parser too, but
whatever it does for poorly formed HTML is up to it.
Synopsis
class HTMLPurifier_Lexer_PEARSax3
extends HTMLPurifier_Lexer
{
- // members
- protected array $tokens = ;
- protected $last_token_was_empty;
- private $parent_handler;
- private array $stack = ;
- // Inherited members from HTMLPurifier_Lexer
- public boolean $tracksLineNumbers;
- protected array $_special_entity2str;
- // methods
- public void tokenizeHTML()
- public void openHandler()
- public void closeHandler()
- public void dataHandler()
- public void escapeHandler()
- public void muteStrictErrorHandler()
- // Inherited methods from HTMLPurifier_Lexer
- public static Concrete create()
- public void __construct()
- public void parseData()
- public HTMLPurifier_Token tokenizeHTML()
- protected static void escapeCDATA()
- protected static void escapeCommentedCDATA()
- protected static void removeIEConditional()
- protected static void CDATACallback()
- public void normalize()
- public void extractBody()
Hierarchy
Extends
Tasks
Line | Task |
---|---|
22+ | Generalize so that XML_HTMLSax is also supported. |
263+ | Consider making protected |
314+ | Consider making protected |
Members
private
- $parent_handler
- $stack
protected
-
$_special_entity2str
Most common entity to raw value conversion table for special entities. - $last_token_was_empty
-
$tokens
Internal accumulator array for SAX parsers.
public
-
$tracksLineNumbers
Whether or not this lexer implements line-number/column-number tracking.
Methods
public
- closeHandler() — Close tag event handler, interface is defined by PEAR package.
- dataHandler() — Data event handler, interface is defined by PEAR package.
- escapeHandler() — Escaped text handler, interface is defined by PEAR package.
- muteStrictErrorHandler() — An error handler that mutes strict errors
- openHandler() — Open tag event handler, interface is defined by PEAR package.
- tokenizeHTML()
Inherited from HTMLPurifier_Lexer
protected
- CDATACallback() — Callback function for escapeCDATA() that does the work.
- escapeCDATA() — Translates CDATA sections into regular sections (through escaping).
- escapeCommentedCDATA() — Special CDATA case that is especially convoluted for <script>
- removeIEConditional() — Special Internet Explorer conditional comments should be removed.
public
- create() — Retrieves or sets the default Lexer as a Prototype Factory.
- extractBody() — Takes a string of HTML (fragment or document) and returns the content
- normalize() — Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
- parseData() — Parses special entities into the proper characters.
- tokenizeHTML() — Lexes an HTML string into tokens.