I need to parse XHTML5 files into XDocument
instances. My files will always be well-formed XML, so I want to avoid HtmlAgilityPack due to its permissiveness of malformed XHTML. The XDocument.Load
method works for simple cases, but breaks when the document contains named character references (entities):
var xhtml = XDocument.Load(reader);
// XmlException: Reference to undeclared entity 'nbsp'.
For XHTML 1.0, this issue could be resolved by using an XmlPreloadedResolver
, which preloads the well-known DTDs that are defined in XHTML 1.0. The approach can be extended to support XHTML 1.1 by manually providing its DTD, as shown in this answer.
However, XHTML5 does not have a DTD, as discussed under this other answer. Its entity definitions are provided for informational purposes as JSON.
<!DOCTYPE html>
Consequently, the XmlResolver
methods are never called when parsing entities in XHTML5. There is a discussion of attempts for providing XmlReader
with a list of entity declarations, but no approach seems to work out of the box.
Currently, there are two approaches I'm looking at. The first is specifying an internal subset with the entity declarations in the document type declaration, either through string manipulation on the source XHTML, or through XmlParserContext.InternalSubset
. This would result in a document type declaration similar to:
<!DOCTYPE html [
<!ENTITY ndash "–">
<!ENTITY nbsp " ">
...
]>
It seems like this is allowed in XHTML5; however, it is undesirable since it litters the XDocument
with the entity declarations (of which there are now more than 2000), which will be problematic if the user converts it back to a string representation.
My other approach is to preprocess the XHTML string using regex to convert all the named character references into numeric character references (or into the actual Unicode characters), excluding the XML predefined entities, " & ' < >
. However, I'm concerned that there are complexities in the definition of XML that this approach might miss. For example, this answer indicates that characters must not be escaped in comments, CDATA sections, or processing instructions. I assume that my regex would need to be tweaked to exclude all these occurrences.
Does anyone have experience or recommendations on the two approaches, or any other approach you'd consider? I would prefer approaches that build on XmlReader
's extensibility, but will resort to source string manipulation if there is no other way.