Java bug? Why extra zero byte in utf8 encoding?

Question

The following code

public class CharsetProblem {
public static void main(String[] args) {
    //String str = "aaaaaaaaa";
    String str = "aaaaaaaaaa";
    Charset cs1 = Charset.forName("ASCII");
    Charset cs2 = Charset.forName("utf8");

    System.out.println(toHex(cs1.encode(str).array()));
    System.out.println(toHex(cs2.encode(str).array()));

}

public static String toHex(byte[] outputBytes) {

    StringBuilder builder = new StringBuilder();

    for(int i=0; i<outputBytes.length; ++i) {
        builder.append(String.format("%02x", outputBytes[i]));
    }

    return builder.toString();
}
}

returns

61616161616161616161
6161616161616161616100

i.e. utf8 encoding returns excess byte. If we take less a-s, then we'll have no excess bytes. If we take more a-s we can get more and more excess bytes.

Why?

How one can workaround this?

Greg Kopff · Accepted Answer · 2012-07-03T22:00:20.377

You can't just get the backing array and use it. ByteBuffers have a capacity, position and a limit.

System.out.println(cs1.encode(str).remaining());
System.out.println(cs2.encode(str).remaining());

produces:

10
10

Try this instead:

public static void main(String[] args) {
  //String str = "aaaaaaaaa";
  String str = "aaaaaaaaaa";
  Charset cs1 = Charset.forName("ASCII");
  Charset cs2 = Charset.forName("utf8");

  System.out.println(toHex(cs1.encode(str)));
  System.out.println(toHex(cs2.encode(str)));
}

public static String toHex(ByteBuffer buff) {
  StringBuilder builder = new StringBuilder();
  while (buff.remaining() > 0) {
    builder.append(String.format("%02x", buff.get()));
  }
  return builder.toString();
}

It produces the expected:

61616161616161616161
61616161616161616161

score 7 · Answer 2 · answered Jul 03 '12 at 21:37

You're assuming that the backing array for a ByteBuffer is precisely the correct size to hold the contents, but it's not necessarily. In fact, the contents don't even need to start at the first byte of the array! Study the API for ByteBuffer and you'll understand what's going on: the contents start at the value returned by arrayOffset(), and the end returned by limit().

score 2 · Answer 3 · answered Jan 20 '14 at 10:55

The answer has already been given, but as I ran into the same problem, I think it might be useful to provide more details:

The byte array returned by invoking cs1.encode(str).array() or cs2.encode(str).array() returns a reference to the whole array allocated to the ByteBuffer at that time. The capacity of the array may be greater than what's actually used. To retrieve only the used portion you should do something like the following:

ByteBuffer bf1 = cs1.encode(str);
ByteBuffer bf2 = cs2.encode(str);
System.out.println(toHex(Arrays.copyOf(bf1.array(), bf1.limit())));
System.out.println(toHex(Arrays.copyOf(bf2.array(), bf2.limit())));

This yields the result you expect.

Java bug? Why extra zero byte in utf8 encoding?

3 Answers3