Using XSLT with Bad HTML

19 Oct 2004 | Tags php xslt programming

We have a PHP CMS with a lot of poorly written HTML in the client-contributed content. This kept causing my XSL template system to output XML errors. I got around this problem by:

  1. wrapping content in CDATA tags
  2. Checking if the content is valid XML with xml_parse() in PHP, if not I add a CDATA tag and try again.
  3. Strip out bad characters that may have crept in from Word
  4. Process the XSL and XML using xsl:value-of tags with disable-output-escaping="yes"

Using CDATA tags around unpredictable HTML helps prevent problems with the XML parser. Without the final step, the resulting HTML contains the original HTML with HTML entities.

In PHP, mb_convert_encoding($string, 'ASCII') has proven very useful for handling text users paste from applications like Word. PHP has to be compiled with —enable-mbstring for this function to work. It prevents strings with different encodings encodings to your XSL from confusing the XML parser (where the encoding is defined).