3

We have a JAVA application that pulls the data from SAP system, parses it and renders to the users. The data is pulled using SAP JCo connector.

Recently we were thrown an exception:

org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.

So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.

My questions here are :

  1. Is there any existing (open source) utility that does this job of replacing illegal characters in XML?
  2. Or if I had to write such utility, how should I handle them?
  3. Why is the above exception thrown?

Thank You.

Sandra Rossi
  • 11,934
  • 5
  • 22
  • 48
jai
  • 21,519
  • 31
  • 89
  • 120
  • So is the data coming from JCO as XML and you're parsing it? Or are you getting a name or something and writing it into an XML document that you're then parsing? – Tom Mar 18 '10 at 06:34
  • @Tom: JCO has Record.toXML() method that gives the data in XML format. – jai Mar 18 '10 at 06:55
  • Just out of curiosity - is there a special reason why you go through all the pain and CPU cycles of transforming the data into XML and then back again? – vwegert Mar 18 '10 at 19:05
  • @vwegert: Good Question. Let me admit that we don't know the JCO API to iterate over the JCO.Fields and thought that toXML() might simply our job. – jai Mar 19 '10 at 11:02
  • 1
    ...okay. I really don't know what to say. Sorry, but the JCo comes with API docs, example programs and a PDF tutorial. Instead of reading it and understanding how to use it, someone thought "let's just throw some XML into this". I honestly don't know whether to laugh or to cry... – vwegert Mar 20 '10 at 08:23

4 Answers4

1

From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.

While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).

regards Guillaume

PATRY Guillaume
  • 4,287
  • 1
  • 32
  • 41
1

It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.

Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:

String goodXml = badXml.replaceAll("&#00;", "");
Tom
  • 43,583
  • 4
  • 41
  • 61
0

I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.

If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.

I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.

Community
  • 1
  • 1
Stephen Denne
  • 36,219
  • 10
  • 45
  • 60
0

You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:

http://commons.apache.org/lang/api-2.4/index.html

To read about how XML character references work, search for "numeric character references" on wikipedia.

Hedley
  • 1,060
  • 10
  • 24