0

I am getting an xml from a third party system in utf-8 format and I am trying to parse it properly and save it in my db. For example below are 4 lines of the xml that I am getting and when I try to use unescapeXML it works for everything except en dash.

String  one  = "<Name>test &apos; test</Name>";
String  two  = "<Fi>Em &#150; S</Fi>";
String three = "<FirstName>a1 &#228;</FirstName>";
String four = "crap&#201;";

System.out.println(StringEscapeUtils.unescapeXml(one));
System.out.println(StringEscapeUtils.unescapeXml(two));
System.out.println(StringEscapeUtils.unescapeXml(three));
System.out.println(StringEscapeUtils.unescapeXml(four));

Output:

<Name>test ' test</Name>

<Fi>Em  S</Fi>

<FirstName>a1 ä</FirstName>

crapÉ

Everything looks fine except the string "two", it should actually be "Em – S".

I am trying to figure out what I am doing wrong and what is the best way to decode such xml strings

anuj
  • 201
  • 5
  • 16

1 Answers1

0

A console may simply not be able to print character – (&#150;).

But when you examine the unescaped string:

String two = "<Fi>Em &#150; S</Fi>";
String twoUnescaped = StringEscapeUtils.unescapeXml(two);
System.out.println(twoUnescaped.codePointAt(7));

you will find that the character reference is correctly unescaped to a Java character with codepoint 150.

wero
  • 32,544
  • 3
  • 59
  • 84
  • thank for the answer, is there a utility where given it only escapes the numeric entities like ones starting with and do not unescape regular xml entities like ' – anuj Feb 25 '16 at 15:43