Java: String to byte array conversion

Question

I am getting some unexpected results from what I thought was a simple test. After running the following:

byte [] bytes = {(byte)0x40, (byte)0xE2, (byte)0x56, (byte)0xFF, (byte)0xAD, (byte)0xDC};
String s = new String(bytes, Charset.forName("UTF-8"));
byte[] bytes2 = s.getBytes(Charset.forName("UTF-8"));

bytes2 is a 14 element long array nothing like the original (bytes). Is there a way to do this sort of conversion and retain the original decomposition to bytes?

As a general point, you say "bytes2 is [...] nothing like the original" - it'd still be useful to include it in the question. — Kristian Glass, Mar 30 '12 at 22:06

score 4 · Accepted Answer · answered Mar 30 '12 at 22:07

4

Is there a way to do this sort of conversion and retain the original decomposition to bytes?

Well that doesn't look like valid UTF-8 to me, so I'm not surprised it didn't round-trip.

If you want to convert arbitrary binary data to text in a reversible way, use base64, e.g. via this public domain encoder/decoder.

answered Mar 30 '12 at 22:07

Jon Skeet

1,421,763
867
9,128
9,194

Skeet that must be it. "This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. In order to detect such sequences, use the CharsetDecoder.decode(java.nio.ByteBuffer) method directly." ( http://docs.oracle.com/javase/6/docs/api/java/nio/charset/Charset.html#decode(java.nio.ByteBuffer) ) – Dilum Ranatunga Mar 30 '12 at 22:21

score 2 · Answer 2 · edited May 24 '12 at 12:17

This should do:

public class Main
{

    /*
     * This method converts a String to an array of bytes
     */
    public void convertStringToByteArray()
    {

        String stringToConvert = "This String is 76 characters long and will be converted to an array of bytes";

        byte[] theByteArray = stringToConvert.getBytes();

        System.out.println(theByteArray.length);

    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args)
    {    
        new Main().convertStringToByteArray();
    }
}

score 1 · Answer 3 · answered Mar 30 '12 at 22:13

Two things:

The byte sequence does not appear to be valid UTF-8

 $ python
 >>> '\x40\xe2\x56\xff\xad\xdc'.decode('utf8')
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 1: invalid continuation byte

Even if it were valid UTF-8, decoding and then encoding can result in different bytes due to things like precombined characters and other Unicode features.

If you want to encode arbitrary binary data in a string in a way where you are guaranteed to get the same bytes back when you decode them, your best bet is something like base64.

Java: String to byte array conversion

3 Answers3