1

I am getting some unexpected results from what I thought was a simple test. After running the following:

byte [] bytes = {(byte)0x40, (byte)0xE2, (byte)0x56, (byte)0xFF, (byte)0xAD, (byte)0xDC};
String s = new String(bytes, Charset.forName("UTF-8"));
byte[] bytes2 = s.getBytes(Charset.forName("UTF-8"));

bytes2 is a 14 element long array nothing like the original (bytes). Is there a way to do this sort of conversion and retain the original decomposition to bytes?

Plastech
  • 757
  • 6
  • 17
  • 1
    As a general point, you say "bytes2 is [...] nothing like the original" - it'd still be useful to include it in the question. – Kristian Glass Mar 30 '12 at 22:06

3 Answers3

4

Is there a way to do this sort of conversion and retain the original decomposition to bytes?

Well that doesn't look like valid UTF-8 to me, so I'm not surprised it didn't round-trip.

If you want to convert arbitrary binary data to text in a reversible way, use base64, e.g. via this public domain encoder/decoder.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Skeet that must be it. "This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. In order to detect such sequences, use the CharsetDecoder.decode(java.nio.ByteBuffer) method directly." ( http://docs.oracle.com/javase/6/docs/api/java/nio/charset/Charset.html#decode(java.nio.ByteBuffer) ) – Dilum Ranatunga Mar 30 '12 at 22:21
2

This should do:

public class Main
{

    /*
     * This method converts a String to an array of bytes
     */
    public void convertStringToByteArray()
    {

        String stringToConvert = "This String is 76 characters long and will be converted to an array of bytes";

        byte[] theByteArray = stringToConvert.getBytes();

        System.out.println(theByteArray.length);

    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args)
    {    
        new Main().convertStringToByteArray();
    }
}
ServAce85
  • 1,602
  • 2
  • 23
  • 51
Shan Valleru
  • 3,093
  • 1
  • 22
  • 21
1

Two things:

  1. The byte sequence does not appear to be valid UTF-8

     $ python
     >>> '\x40\xe2\x56\xff\xad\xdc'.decode('utf8')
     Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
       File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
         return codecs.utf_8_decode(input, errors, True)
     UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
    
  2. Even if it were valid UTF-8, decoding and then encoding can result in different bytes due to things like precombined characters and other Unicode features.

If you want to encode arbitrary binary data in a string in a way where you are guaranteed to get the same bytes back when you decode them, your best bet is something like base64.

Geoff Reedy
  • 34,891
  • 3
  • 56
  • 79