0

I am using XercesLib c++ library to parse html file. In my case html file may have angle brackets inside tag content.

<math>
<mo> < <mo>
</math>

Now XercesLib fail to parse the content of mo tag, It gives me empty output, if any tag has non escaped characters.

I can not tell source to provide escaped input file because same file can be parsed by some JavaScript library (MathJAx) without any problem.

How to fix this problem in XercesLib?

Pavan Tiwari
  • 3,077
  • 3
  • 31
  • 71
  • The example shown is not well-formed XML. You need to quote the `<` by writing `<` https://en.wikipedia.org/wiki/Well-formed_document https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents – Erik Sjölund Aug 05 '17 at 18:27
  • same XML is supported by some other library, I have to support same in XercesLib. – Pavan Tiwari Aug 05 '17 at 18:37

1 Answers1

0

Per the comments, this is simply not valid MathML (or even valid XML).

That MathJax can parse this should be considered lucky and not a feature of MathJx. From their docs:

The MathML support is still under active development, so some tags are not yet implemented, and some features are not fully developed, but are coming.

It would be reasonable to believe that some future version of MathJax will no longer support the MathML example you give and I doubt that they would explicitly support invalid XML.

For the record, MathJax doesn't actually parse the XML; it applies an XSLT transform to it. It's also manipulating the input XML, because if you view the "Original MathML", you get:

<math>
<mo> &lt; <mo>
</mo></mo></math>

In short, you really need to push back on the provider of the invalid XML or you're going to find yourself in a much trickier position in the future.

Dancrumb
  • 26,597
  • 10
  • 74
  • 130