0

I am trying to send a POST request, where the request body contains an XML. The receiving API demands that any special characters should encoded with numeric xml-entities.

Lets take the example: İlkay Gündoğan

After xml-escaping with standard libraries like org.apache.commons.text.StringEscapeUtils or using Jsoup with XML-Parser, it only produces:

İlkay Gündoğan, but it ignores İ and ğ. I already read the documentation of those mentioned libs and read that only a certain range of characters is escaped.

  • Why are those libs only converting specific ranges ?
  • is there any lib for jvm, which supports escaping accentuated characters like İ and ğ.

I already tried sending a manual crafted example (İlkay Gündoğan) to the recv. API and it worked as expected.

All values are written and read in UTF-8.

lunatikz
  • 716
  • 1
  • 11
  • 27
  • 1
    Double check your locale setting – Thorbjørn Ravn Andersen Apr 29 '21 at 12:23
  • 2
    If all values are written and read in UTF-8 then you don't need to escape *any* of those characters. If the receiving API demands that, then it doesn't accept valid XML. Numeric XML entities and the actual characters should be equivalent in a valid XML processor. Also "special characters" is **incredibly** ill-defined. **tl;dr** if you need this, you'll have to build it yourself, because that's not a standard requirement. – Joachim Sauer Apr 29 '21 at 12:24
  • @JoachimSauer not 100% sure, if the recv. API is reading in UTF-8, but its mentioned in their specs, as well as they expect encoded xml-entities. The problem would be fixed for me, if I would have a lib, which is able to escape those chars in numeric entities – lunatikz Apr 29 '21 at 12:27
  • @lunatikz: does it mention what it considers to be "special characters"? – Joachim Sauer Apr 29 '21 at 12:27
  • @JoachimSauer: It's not explicitly stated, but I think they refer to any non-ASCII character with special chars. – lunatikz Apr 29 '21 at 12:28
  • 1
    @lunatikz: That would honestly be a bit silly. XML **specifically** has an encoding header to allow the use of any encoding, especially something like UTF-8 where escaping non-ASCII character isn't needed. If they then go ahead and require them to be escaped, then they ruin the basic idea of XML (i.e. they pay all the costs without getting any of the benefits). What you can try is to configure your XML library to explicitly use ASCII as the encoding, which should make it automatically escape all non-ASCII characters. – Joachim Sauer Apr 29 '21 at 12:33
  • @JoachimSauer: Yes I'm aware of it and actually I set the xml tagb with version and encoding. Not sure, if the recv. side is handling XML 100% correct. I will follow your suggestion and try this. Thanks so far. – lunatikz Apr 29 '21 at 12:38
  • @ThorbjørnRavnAndersen locale is set to utf-8. – lunatikz Apr 29 '21 at 12:39
  • Any standard XML parser should accept UTF-8 encoded XML. They might be working around an encoding problem on the way (by going into plain ASCII). In other words, this might be an X-Y problem. – Thorbjørn Ravn Andersen Apr 29 '21 at 12:49

1 Answers1

3

If the XML encoding is UTF-8 (the default), then converting special characters to numeric entities is not needed. So you have a dubious receiver. escapeXml11 is indeed limited as the javadocs say.

To translate all non-ASCII characters for a String xml:

xml = xml.codePoints()
    .map(cp -> cp < 128 ? Character.toString(cp) : String.format("&#%d;", cp))
    .collect(Collectors.joining());

You might even set the encoding="US-ASCII".

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Yeah I would also conclude that the receiving side is faulty. Your approach works well for me, the recv. side is now able to fully parse my request. I was wondering why most libs only support a limited range. This somehow applies to many. – lunatikz Apr 29 '21 at 14:16
  • 1
    In java there is the problem, that sometimes two `char`s form one Unicode code point. Hence using `Pattern.Matcher.replaceAll with a lambda does not work for Asian scripts. Here however I can only guess that tables of encodings where used. Which is plain dumb. – Joop Eggen Apr 29 '21 at 17:19