How to convert CharSequence to UTF-8 encoded byte array in Java?

Question

I am trying to convert a CharSequence to a UTF-8 encoded byte[] array.

And i've been having problems with it, so i was going to ask stackoverflow for help. And i was going to write a Java Fiddle to do it:

https://www.mycompiler.io/view/3MliN0HgwDD

Except the fiddle itself doesn't work:

import java.util.*;
import java.lang.*;
import java.io.*;
import java.nio.*;
import java.nio.charset;

// The main method must be in a class named "Main".
class Main {
    public static byte[] charSequenceToUtf8(final CharSequence input)
    {
        //char[] chars = new char[input.length];
        //for (int i=0; i<input.length; i++)
        //  chars[i] = input.charAt(i);

        CharBuffer charBuffer = CharBuffer.wrap(input);
        checkEquals(10, charBuffer.length(), "Charbuffer is wrong length");

        Charset cs = Charset.forName("UTF-8"); 
        ByteBuffer byteBuffer = cs.encode(charBuffer);
        checkEquals(10, byteBuffer.length(), "byteBuffer is wrong length");
        
        byte[] utf8 = byteBuffer.array();        
        checkEquals(10, utf8.length, "utf8 bytes is wrong length");
    }
    
    public static void checkEquals(int expected, int actual, String message)
    {
        if (expected == actual)
            return;
            
        String sExpected = String.valueOf(expected);
        String sActual = String.valueOf(actual);
        
        throw new Exception("Test failed. Expected "+sExpected+", Actual "+sActual+". "+message);
    }
    
    public static void main(String[] args) {
        test("AAAAAAAAAA"); //ten A's
    }
}

It seems that java.nio requires at least Java 7 ^ref. Which is why it is confusing to me that it doesn't work in Java 16:

So this bring up a lot of questions:

how can i convert a CharSequence to a byte[] array? ¹
why does it not work in Java 16?

In the end, the actual bug is that trying to encode the string AAAAAAAAA returns an 11-element array:

CharSequence	UTF-8 byte array
"AA"	`[65, 65]`
"AAA"	`[65, 65, 65]`
"AAAA"	`[65, 65, 65, 65]`
"AAAAA"	`[65, 65, 65, 65, 65]`
"AAAAAA"	`[65, 65, 65, 65, 65, 65]`
"AAAAAAA"	`[65, 65, 65, 65, 65, 65, 65]`
"AAAAAAAA"	`[65, 65, 65, 65, 65, 65, 65, 65]`
"AAAAAAAAA"	`[65, 65, 65, 65, 65, 65, 65, 65, 65]`
"AAAAAAAAAA"	`[65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 0]`

Why is the above code, that i stole from the linked question, failing of a string of 10 characters?

Looks like it's working to me. Don't know _why_, but it appears encoding `"AAAAAAAAAA"` returns a `ByteBuffer` whose **capacity** is `11`, but its **limit** is set to `10`. You're printing out the entire backing array without taking the limit into account. — Slaw, Jul 08 '22 at 21:24
Why are you doing it this way? What's wrong with just doing `.getBytes(StandardCharsets.UTF_8)`? — Sweeper, Jul 08 '22 at 21:32
In addition to @DuncG comment, there is no reason to have `import java.lang.*;`. In Java, that import is implicitly present in all source files. — Slaw, Jul 08 '22 at 21:44
@Sweeper [**CharSequence**](https://docs.oracle.com/javase/7/docs/api/java/lang/CharSequence.html) doesn't have a `.getBytes()` method — Ian Boyd, Jul 09 '22 at 00:19
@Slaw For that you can either blame `mycompiler.io`. (not that it matters at all) — Ian Boyd, Jul 09 '22 at 00:24

Slaw · Answer 1 · 2022-07-08T21:51:24.037

First, note that if you have a String, then you can simply do:

byte[] bytes = theString.getBytes(StandardCharsets.UTF_8);

Or, even if you have a CharSequence, you can do:

byte[] bytes = theCharSequence.toString().getBytes(StandardCharsets.UTF_8);

That will potentially create a String copy of the CharSequence if it's not already a String, though it should be quickly garbage collected.

But regarding your question, you're not taking the ByteBuffer's limit (or position, though it's 0 in this case) into account. For whatever reason, encoding "AAAAAAAAAA" results in a buffer whose capacity is 11, but whose limit is 10. But the #array() method returns the entire backing array, regardless of the buffer's position or limit. This means you need to manually take the limit (and position) into account when converting the ByteBuffer to a byte[].

For example:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class Main {

  public static void main(String[] args) throws Exception {
    for (int i = 1; i <= 10; i++) {
      String string = "A".repeat(i);

      CharBuffer chars = CharBuffer.wrap(string);
      ByteBuffer bytes = StandardCharsets.UTF_8.encode(chars);

      System.out.printf("%-10s | %s%n", string, Arrays.toString(toByteArray(bytes)));
    }
  }

  public static byte[] toByteArray(ByteBuffer buffer) {
    byte[] array = new byte[buffer.remaining()];
    buffer.get(buffer.position(), array);
    return array;
  }
}

Which will output:

A          | [65]
AA         | [65, 65]
AAA        | [65, 65, 65]
AAAA       | [65, 65, 65, 65]
AAAAA      | [65, 65, 65, 65, 65]
AAAAAA     | [65, 65, 65, 65, 65, 65]
AAAAAAA    | [65, 65, 65, 65, 65, 65, 65]
AAAAAAAA   | [65, 65, 65, 65, 65, 65, 65, 65]
AAAAAAAAA  | [65, 65, 65, 65, 65, 65, 65, 65, 65]
AAAAAAAAAA | [65, 65, 65, 65, 65, 65, 65, 65, 65, 65]

Note the above example copies a region of the buffer's backing array, though the original ByteBuffer should be quickly garbage collected. The only way to avoid copying the backing array, that I can think of, is to adapt your code to work with the ByteBuffer directly (if you only return the backing array, you lose the position/limit information). Or I suppose you could create a wrapper class.

How to convert CharSequence to UTF-8 encoded byte array in Java?

1 Answers1