0

The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ

With Powershell I was able to verify, that the web service is returning the correct characters.

I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.

Any hints on this one?

import java.nio.charset.StandardCharsets;

def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();

content =  new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
    if ( bytes[i] != content[i] ) {
        println "Different... at pos " + i
        hex =  Long.toUnsignedString( bytes[i], 16).toUpperCase()
        print hex.substring(hex.length()-2,hex.length()) + " != "
        hex =  Long.toUnsignedString( content[i], 16).toUpperCase()
        println hex.substring(hex.length()-2,hex.length())
       }
}

Thanks a lot

Andreas

Andreas
  • 3
  • 3
  • When you run the script at https://groovyconsole.appspot.com/edit/4831238661603328, are you seeing different behavior? – Jeff Scott Brown Nov 02 '22 at 18:29
  • `new String(bytes).getBytes()`almost certainly doesn't do what you want it to do, and even if you fix it to use UTF-8 by using `new String(bytes, "UTF-8").getBytes("UTF-8"))` all it does is change the bytes *if* the input isn't valid UTF-8. – Joachim Sauer Nov 03 '22 at 08:21

1 Answers1

1

you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.

Charset.defaultCharset() - Returns the default charset of this Java virtual machine.

The same problem with String.getBytes() - use charset parameter to get correct byte sequence.

Just change the following line in your code and issue will disappear:

content =  new String(bytes, "UTF-8").getBytes("UTF-8")

as an option you can set default charset for the whole JVM instance with the following command line parameter:

java -Dfile.encoding=UTF-8 <your application>

but be careful because it will affect whole JVM instance!

https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F

daggett
  • 26,404
  • 3
  • 40
  • 56