$Id: Readme,v 1.4 2003/10/08 10:15:40 harryf Exp $ ++Introduction XML_HTMLSax is a SAX based XML parser for badly formed XML documents, such as HTML. The original code base was developed by Alexander Zhukov and published at http://sourceforge.net/projects/phpshelve/. Alexander kindly gave permission to modify the code and license for inclusion in PEAR. PEAR::XML_HTMLSax provides an API very similar to the native PHP Expat extension, allowing handlers using one to be easily adapted to the other. The key difference is HTMLSax will not break on badly formed XML, allowing it to be used for parsing HTML documents. Otherwise HTMLSax supports all the handlers available from Expat except namespace and external entity handlers. Provides methods for handling XML escapes as well as JSP/ASP opening and close tags. Version 2 has had it's internals completely overhauled to use a Lexer, delivering performance *approaching* that of the native XML extension, as well as a radically improved, modular design that makes adding further functionality easy. The public API has remained the same as older versions, except for the set_option() method, the available options having been renamed with additional options which allow HTMLSax to behave almost exactly like the native Expat extension, allowing contents of XML elements which contain linefeeds, tabs and XML entities to trigger additional calls to the registered data handler, as defined by XML_HTMLSax::set_data_handler(). A big thanks to Jeff Moore (lead developer of WACT: http://wact.sourceforge.net) who's largely responsible for new design, as well input from members at Sitepoint's Advanced PHP forums: http://www.sitepointforums.com/showthread.php?threadid=121246. Thanks also to Marcus Baker (lead developer of SimpleTest: http://www.lastcraft.com/simple_test.php) for sorting out the unit tests. ++Uses Some particular situations where XML_HTMLSax can be useful include; - Parsing XML documents (such as those online) where the source is out of your control and Expat is choking because it's badly formed. - Converting HTML to XHTML - Filtering form posts to allow a limited HTML subset (probably with help from PEAR::XML_SaxFilters) - Reading HTML based content from a database and converting to PDF (with help from a PDF generation library and probably PEAR::XML_SaxFilters as well) - Parsing ASP(.NET) and JSP pages. - Creating a PHP-GTK based web browser? A PHP CSS Parser exists: http://www.phpclasses.org/browse.html/package/1081.html ++Features - Won't "break" on badly formed XML. May in some instances get it "wrong" (see Limitations) but will continue parsing. - Provides an API similar to the native PHP XML Expat extension - Can be instructed to behave in more or less the same manner as Expat, when dealing with linefeeds, tabs and XML entities - In addition to handling basic XML elements attributes and data also capable of dealing with; - Processing instructions e.g. / etc. Within PI's XML entities are not parsed (i.e. ignore < and > ) - XML Escape markup such as , and . Within this XML entities are not parsed (useful for JavaScript, for example) - JSP / ASP (JASP) marked up with <% %>. Note: You will need to deal with <%@ %> and <%= %> yourself. With JASP markup XML entities are not parsed ++Usage Notes - Performance-wise, it runs faster on PHP 4.3.0 thanks to strspn() and strcspn() supporting position arguments. For older PHP versions while loops are used to achieve the same effect, meaning a slightly higher overhead. Note also that setting XML options with XML_HTMLSax::set_option() also slows down the parser, the options being handled by "decorators" which perform some further formatting on XML events which have already been parsed. - By default, no parser options are set - Regarding the XML_OPTION_ENTITIES_PARSED, this uses the html_entity_decode() function which is only available in PHP 4.3.0+. To get round this, HTMLSax checks your PHP version and for the function name html_entity_decode. If not found, it defines a function which mirrors the behavior of the native PHP html_entity_decode(). Both XML_OPTION_ENTITIES_PARSED and XML_OPTION_ENTITIES_UNPARSED can be used down to PHP version 4.0.5, due to the regular expression used to find entities. - For attributes which have just a name but no value e.g.