-2

I'm parsing an xml file with saxParser on java. My problem is that I have some rows like this:

<name xml:lang="en">Particulates, < 2.5 um</name>

I don't report all the code but if the tag == name I set the name on my object.

    @Override
public void characters(char[] ch, int start, int length) throws SAXException {
    if (isElementaryExchange && isName ) {
        String name = new String(ch, start, length);
        this.currentElementaryFlowBase.setName(name);
    }

The problem is that the result is name=" 2.5 um" because I think that the "<" broke something. There's a way to parse correctly that row? Thanks


EDIT Solved with a Stringbuilder: Append on characters method and set the result only at the end of the element!

Davide
  • 75
  • 1
  • 2
  • 11
  • 1
    I down voted because [No research](http://idownvotedbecau.se/noresearch/) https://lmgtfy.app/?q=excaping+XML+special+cars+ – Timothy Truckle Jun 23 '21 at 16:46
  • Sorry but I can explain, I cannot modify xml files with escape characters because I have more than 17 million of files, and I'm not authorized to modify these xml, so I need to solve the issue with sax parser (I cannot change the parser) – Davide Jun 24 '21 at 07:02
  • *"I cannot modify xml files"* **--** Your files are not [well formed](https://www.w3resource.com/xml/well-formed.php) and therefore **no proper implemented XML tool will process them**. – Timothy Truckle Jun 24 '21 at 17:02
  • Solved with a Stringbuilder: Append on characters method and set the result only at the end of the element. I understant your point, but I'm not the boss, and if the boss asked to me to solve problems I need to solve it. I asked if there is a way, not the best practice, because I have 20 millions of these xml, and the point is: failing a big project or try to solve the issue. And the solution is quite simple, so why not? – Davide Jun 27 '21 at 15:35

1 Answers1

1

The "less than" char < is not escaped, so the XML is invalid.
See Section 2.4 at the W3C XML definition:

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively.

Or, in RegEx terms:

CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

So you have to escape the < to get a valid XML (e.g. with &lt;). Otherwise your input file is not valid XML, and you have to complain to its creator for any follow-up problems.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • Yes, but the problem is that I have more than 17 million of xml files to parse, and I cannot modify these files, so I need to solve the issue with the parser. – Davide Jun 24 '21 at 07:00
  • So you mean that you have 17 million of erroneous XML files? That's an interesting task. Writing a non-standard-parser for this is...well...I don't envy you at all... I'm out. – zx485 Jun 24 '21 at 07:20
  • Solved with a StreamBuilder. Append on characters method and set the result only at the end of the element – Davide Jun 24 '21 at 07:56