Parse XHTML5 into XDocument

Question

I need to parse XHTML5 files into XDocument instances. My files will always be well-formed XML, so I want to avoid HtmlAgilityPack due to its permissiveness of malformed XHTML. The XDocument.Load method works for simple cases, but breaks when the document contains named character references (entities):

var xhtml = XDocument.Load(reader);
// XmlException: Reference to undeclared entity 'nbsp'.

For XHTML 1.0, this issue could be resolved by using an XmlPreloadedResolver, which preloads the well-known DTDs that are defined in XHTML 1.0. The approach can be extended to support XHTML 1.1 by manually providing its DTD, as shown in this answer.

However, XHTML5 does not have a DTD, as discussed under this other answer. Its entity definitions are provided for informational purposes as JSON.

<!DOCTYPE html>

Consequently, the XmlResolver methods are never called when parsing entities in XHTML5. There is a discussion of attempts for providing XmlReader with a list of entity declarations, but no approach seems to work out of the box.

Currently, there are two approaches I'm looking at. The first is specifying an internal subset with the entity declarations in the document type declaration, either through string manipulation on the source XHTML, or through XmlParserContext.InternalSubset. This would result in a document type declaration similar to:

<!DOCTYPE html [
  <!ENTITY ndash "&#8211;">
  <!ENTITY nbsp "&#160;">
  ...
]>

It seems like this is allowed in XHTML5; however, it is undesirable since it litters the XDocument with the entity declarations (of which there are now more than 2000), which will be problematic if the user converts it back to a string representation.

My other approach is to preprocess the XHTML string using regex to convert all the named character references into numeric character references (or into the actual Unicode characters), excluding the XML predefined entities, " & ' < >. However, I'm concerned that there are complexities in the definition of XML that this approach might miss. For example, this answer indicates that characters must not be escaped in comments, CDATA sections, or processing instructions. I assume that my regex would need to be tweaked to exclude all these occurrences.

Does anyone have experience or recommendations on the two approaches, or any other approach you'd consider? I would prefer approaches that build on XmlReader's extensibility, but will resort to source string manipulation if there is no other way.

The [last comment](http://stackoverflow.com/questions/3215053/xhtml5-and-html4-character-entities/3215289#comment48224432_3215289) seems to have updated information regarding official list of XHTML5 characer entities *as DTD*. I haven't done this before, but if I understand your explanation correctly, given the DTD you can now use the `XmlPreloadedResolver` approach... — har07, Feb 20 '16 at 09:57
I haven't tried, but how about using the XHTML1.1 doctype declaration (that includes the DTD), and then defining the new HTML5 elements with markup declarations such as `<!ELEMENT section ( #PCDATA | %Flow.mix; )*>`. — Mr Lister, Feb 20 '16 at 14:01
This may sound like as much a kludge as manually defining all new entities, but there are far fewer than 2000 new elements in HTML5 compared to XHTML1.1! — Mr Lister, Feb 20 '16 at 14:02
I can test later but perhaps using doctype with entity map and then transform first with the identity transform. I believe this would do the substitutions for you much like trying to write some huge regex — Kevin Brown, Feb 20 '16 at 17:34

score 1 · Answer 1 · answered Feb 20 '16 at 19:22

If you apply the identity translate to your source document with the entity map in place, it would substitute the actual characters for you in the result. To me, this is no different (one step) as the regex and certainly much less complex.

Given this source:

<!DOCTYPE foo [
 <!ENTITY ndash "&#8211;">
 <!ENTITY nbsp "&#160;">
]>
<foo>
  <p>I am &ndash; and I am&nbsp;non-breaking space.</p>
</foo>

And this transform:

        <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        version="1.0">
        <xsl:template match="@*|node()">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>

You would have this result as your new input:

<foo>
   <p>I am – and I am non-breaking space.</p>
</foo>

Further, you could just keep all those definitions in a separate file and add one reference to them like this:

<!ENTITY % winansi SYSTEM "path/to/my/map/winansi.xml">  %winansi;]>

Parse XHTML5 into XDocument

1 Answers1