0

I am trying to parse an XML file with the "less than" and "greater than" symbols in the text.

Here is a sample XML file:

<document>
    <summary>
    The equation for t is: 567<T<600.
    </summary>
</document>

Is there any way to handle this in a Java XML parser? I know about escaping and changing to

&lt;

and

&gt;

but I only want to escape the characters in the text.

Currently, I am trying to use the DocumentBuilder, but it is erroring out.

        DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
        domFactory.setNamespaceAware(true);
        domFactory.setExpandEntityReferences(false);
        try {
            DocumentBuilder builder = domFactory.newDocumentBuilder();
            Document document = builder.parse(new InputSource(new StringReader(sectionXML.toString())));
        } catch (ParserConfigurationException e) {
                    e.printStackTrace();
                }

The error I am getting is:

[Fatal Error] :1:70: Element type "T" must be followed by either attribute specifications, ">" or "/>".
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 70; Element type "T" must be followed by either attribute specifications, ">" or "/>".

Any thoughts? Thanks in advance for any help.

user1472409
  • 397
  • 1
  • 7
  • 20
  • 1
    Try putting this kind of text inside a CDATA section. – Arnaud May 12 '17 at 13:29
  • I thought about that, but how could I parse it to wrap only the text in the CDATA section? – user1472409 May 12 '17 at 13:30
  • See here : http://stackoverflow.com/questions/8489151/how-to-parse-xml-for-cdata – Arnaud May 12 '17 at 13:33
  • Ah, this seems be parsing the XML for a CDATA section, which currently does not exist in my data. – user1472409 May 12 '17 at 13:34
  • 2
    So you want to parse an invalid XML? – Tamas Rev May 12 '17 at 13:36
  • Essentially, yes. – user1472409 May 12 '17 at 13:37
  • It's a catch 22. You cannot fix an invalid XML with XML parsers :( If you are sure that these things occur only between `` and `` then you can fix it with a regex. – Tamas Rev May 12 '17 at 13:44
  • That may work. I also read that I can use XSLT to wrap in CDATA. Which one do you recommend? – user1472409 May 12 '17 at 13:48
  • I don't think you can apply XSLT to an invalid XML. Just let me know if you could apply something :) I'd go with a regex or manual editing. – Tamas Rev May 12 '17 at 14:15
  • Just to make this clear: You cannot use an XML parser on that file as it is. Not DocumentBuilder, not SAXParser, not XMLInputFactory, not XSLT. You will have to write your own code to fix the text. I don’t expect it will be an easy endeavor. – VGR May 12 '17 at 15:06
  • It's not an XML file. It's a non-XML file. You need a non-XML parser. Better, get your data suppliers to use XML: it's a great standard when people actually trouble to use it. – Michael Kay May 12 '17 at 19:08

0 Answers0