9

I'm trying to sort out characters, their representation in byte sequences according to character sets, and how to convert from one character set to another in Java. I've some difficulties.

For instance,

ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());

My understanding is that:

  • String are always stored as UTF-16 byte sequence in Java (2 bytes per character, big endian)
  • getBytes() result is this same UTF-16 byte sequence
  • wrap() maintains this sequence
  • bybf is therefore an UTF-16 big endian representation of the string Olé

Thus in this code:

Charset utf16 = Charset.forName("UTF-16");  
CharBuffer chbf = utf16.decode(bybf);  
System.out.println(chbf);  

decode() should

  • Interpret bybf as an UTF-16 string representation
  • "convert" it to the original string Olé.

Actually no byte should be altered since everything is UTF-16 stored and UTF-16 Charset should be a kind of "neutral operator". However the result is printed as:

??

How can that be?

Additional question: For converting correctly, it seems Charset.decode(ByteBuffer bb) requires bb to be an UTF-16 big endian byte sequence image of a string. Is that correct?


Edit: From the answers provided, I did some testing to print a ByteBuffer content and the chars obtained by decoding it. Bytes [encoding with ="Olé".getBytes(charsetName)] are printed on first line of groups, the other line(s) are the strings obtained by decoding back the bytes [with Charset#decode(ByteBuffer)] with various Charset.

I also confirmed that the default encoding for storing String into byte[] on a Windows 7 computer is windows-1252 (unless strings contain chars requiring UTF-8).

Default VM encoding: windows-1252  
Sample string: "Olé"  


  getBytes() no CS provided : 79 108 233  <-- default (windows-1252), 1 byte per char
     Decoded as windows-1252: Olé         <-- using the same CS than getBytes()
           Decoded as UTF-16: ??          <-- using another CS (doesn't work indeed)

  getBytes with windows-1252: 79 108 233  <-- same than getBytes()
     Decoded as windows-1252: Olé

         getBytes with UTF-8: 79 108 195 169  <-- 'é' in UTF-8 use 2 bytes
            Decoded as UTF-8: Olé

        getBytes with UTF-16: 254 255 0 79 0 108 0 233 <-- each char uses 2 bytes with UTF-16
           Decoded as UTF-16: Olé                          (254-255 is an encoding tag)
mins
  • 6,478
  • 12
  • 56
  • 75

3 Answers3

10

You are mostly correct.

The native character representation in java is UTF-16. However when converting characters to bytes you either specify the charset you are using, or the system uses it's default which has usually been UTF-8 whenever I checked. This will yield interesting results if you are mixing and matching.

eg for my system the following

System.out.println(Charset.defaultCharset().name());
ByteBuffer bybf = ByteBuffer.wrap("Olé".getBytes());
Charset utf16 = Charset.forName("UTF-16");
CharBuffer chbf = utf16.decode(bybf);
System.out.println(chbf);
bybf = ByteBuffer.wrap("Olé".getBytes(utf16));
chbf = utf16.decode(bybf);
System.out.println(chbf);

produces

UTF-8
佬쎩
Olé

So this part is only correct if UTF-16 is the default charset
getBytes() result is this same UTF-16 byte sequence.

So either always specify the charset you are using which is safest as you will always know what is going on, or always use the default.

BevynQ
  • 8,089
  • 4
  • 25
  • 37
  • 2
    most windows systems do _not_ default to utf-8. also, not sure what you mean by "UTF-16 ish". java uses UTF-16. – jtahlborn Jun 30 '14 at 02:06
  • Thanks BevynQ. I'm currently learning Java, your demonstration has been very useful to me. – mins Jun 30 '14 at 06:53
  • 1
    @jtahlborn: my default CS was windows-1252 until I changed the sample string to "I♥café". Adding the heart made Java switch to UTF-8. Very educative. – mins Jun 30 '14 at 06:54
8

String are always stored as UTF-16 byte sequence in Java (2 bytes per character, big endian)

Yes.

getBytes() result is this same UTF-16 byte sequence

No. It encodes the UTF-16 chars into the platform default charset, whatever that is.

wrap() maintains this sequence

wrap() maintains everything.

bybf is therefore an UTF-16 big endian representation of the string Olé

No. It wraps the platform's default encoding of the original string.

decode() should

  • Interpret bybf as an UTF-16 string representation

No, see above.

  • "convert" it to the original string Olé.

Not unless the platform's default encoding is "UTF-16".

user207421
  • 305,947
  • 44
  • 307
  • 483
  • 1
    Thanks for the very detailed answer. I would have selected it as a correct one too if it was possible to select multiple answers. [getBytes()](http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes--) is still not deprecated, though it is discouraged. – mins Jun 30 '14 at 07:05
  • 1
    @EJP The only #getBytes() that is deprecated is [`public void getBytes(int srcBegin, int srcEnd, byte[] dst, int dstBegin)`](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-int-int-byte:A-int-), all other overloaded versions of this method (including the one without any arguments) aren't deprecated. – klaar Apr 05 '16 at 13:52
0

I had nearly the same problem with data encoded in doublebyte charset. Answer 3 above contains already the critical pitfalls you should keep an eye on.

  1. Define a Charset for the source encoding.
  2. Define a Charset only for the target encoding if it is different from your local sytem encoding.

Following code works

public static String convertUTF16ToString(byte[] doc)
{
    final Charset doublebyte = StandardCharsets.UTF_16;
    // Don't need this because it is my local (system default).  
    //final Charset ansiCharset = StandardCharsets.ISO_8859_1;

    final CharBuffer encoded = doublebyte.decode(ByteBuffer.wrap(doc));
    StringBuffer sb = new StringBuffer(encoded);
    return sb.toString();        
}

Replace system default by your favorite encoding.

public static String convertUTF16ToUTF8(byte[] doc)
{
    final Charset doublebyte = StandardCharsets.UTF_16; 
    final Charset utfCharset = StandardCharsets.UTF_8; 
    final Charset ansiCharset = StandardCharsets.ISO_8859_1;

    final CharBuffer encoded1 = doublebyte.decode(ByteBuffer.wrap(doc));
    StringBuffer sb = new StringBuffer(encoded1);
    final byte[] result = ansiCharset.encode(encoded1).array();
    // alternative to utf-8
    //final byte[] result = utfCharset.encode(encoded1).array();

    return new String(result);        
}
Wolf
  • 9
  • 1