-1

I have a problem with an xml that contains special characters (the problematic string is löööschee`*‘‘§a). The xml comes as an XOM Object in Java. While investigating the problem I tried to print out the text of the xml with a serializer. I noticed that streaming directly to System.out was the only way to get the correct string.

Here is the code I used for printing out the xml:

Element pEntry; //this is the XOM object I get, it contains the xml
Document document = pEntry.getDocument();
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Serializer serializer = new Serializer(stream);
Serializer serializer2 = new Serializer(System.out);
try {
    serializer.write(document);
    serializer2.write(document);
} catch (IOException e) {
    System.out.println(e.getMessage());
}
System.out.println("#####################################################################");
System.out.println(stream);

So serializer2 writes directly to System.out, there the string is as it should be. The System.out.println prints the string as l??????schee`*????????a. I tried many different things with different encodings (the standard encoding for the serializer is "UTF-8" which seems correct), but the only way I found, that prints out the correct string is directly streaming to System.out.
I also printed the bytes of the first stream, that does not work and this was the output:
6c ffffffc3 ffffffb6 ffffffc3 ffffffb6 ffffffc3 ffffffb6 73 63 68 65 65 60 2a ffffffe2 ffffff80 ffffff98 ffffffe2 ffffff80 ffffff98 ffffffc2 ffffffa7 61.
I don't really know if this is correct and I can't print out the bytes that are streaming directly to System.out. I saw that c3 b6 for example should be an ö, which would be correct, but I don't know about the ffffffs.
Why are they different, even if they use the same encoding?

Other things I tried:

  • adding -J-Dfile.encoding=UTF-8 to the javac command -> didn't make a difference
  • initializing the serializers with different encodings (UTF-8, UTF-16, US-ASII) -> the only thing that worked correct for serializer2 was UTF-8 so I assume this is the correct encoding
  • Instead of System.out.println(stream) putting
    String xmlContent = stream.toString(StandardCharsets.UTF_8);
    System.out.println(xmlContent);
    
    -> this was at least an improvement I think, the string then looked like l???schee`*?????a
skomisa
  • 16,436
  • 7
  • 61
  • 102
  • The console may be badly configured, so when it receives the bytes misinterpret them. Change the console encoding to the correct value. – Jean-Baptiste Yunès Aug 30 '23 at 08:04
  • 1
    "I tried many different things with different encodings" please show these things. – Andy Turner Aug 30 '23 at 08:12
  • Why do you care? The String encoding of binary data isn't useful, and neither is streaming serializations to `System.out`. Print out the actual bytes. You will find they are exactly the same and that you have nothing to worry about. – user207421 Aug 30 '23 at 08:26
  • The problem is actually that I am sending this xml to another service which is saying invalid UTF-8. I thought if I can just get the output of the correct stream to System.out somehow it should work. And the System.out only shows the correct string if I use UTF-8, so I think it is the correct encoding. – Michael Lamprecht Aug 30 '23 at 08:51
  • What `Serializer` class are you using? And how are you sending the data to the server? If that involves `String` or `Writer`, don't. Send the bytes. – user207421 Aug 30 '23 at 08:59
  • I am using the nu.xom.Serializer. I could solve the console problem by setting the encoding of the console to UTF-8. I still have to figure out why my other service is throwing an error, thought that would be related, but this question is solved. – Michael Lamprecht Aug 30 '23 at 09:03
  • Ah sending the bytes is a good call, I will try that thanks. – Michael Lamprecht Aug 30 '23 at 09:04
  • Yeah, sending the bytes and not a string also solved my other problem, thank you! – Michael Lamprecht Aug 30 '23 at 09:06
  • @MichaelLamprecht FYI, I edited your question to remove the solution to your problem, which you have posted as an answer. Please don't edit your question to do that. It is simply not how things are done here, and it is confusing to readers. – skomisa Sep 01 '23 at 05:43

2 Answers2

0

Putting the line System.setOut(new PrintStream(System.out, true, StandardCharsets.UTF_8)); above the console output solved the problem, now the console is always showing the correct string.

-1

You get the right output with serializer2 maybe because the console probably uses the default encoding of your system (mostlikely UTF8, which can display the special characters correctly).

With serializer you're using ByteArrayOutputStream, which may not inherently handle character encoding like Console does. You can try explicitly providing the encoding while converting the ByteArrayOutputStream to a string. Something like new String(stream.toByteArray(), StandardCharsets.UTF_8).

Pradipta Sarma
  • 1,142
  • 10
  • 17