UTF-16 Encoding in Java versus C#

Question

I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it.

The following is the piece of code in Java:

public static void main(String[] args) {
    String str = "preparar mantecado con coca cola";
    try {
        MessageDigest digest = MessageDigest.getInstance("MD5");
        digest.update(str.getBytes("UTF-16"));
        byte[] hash = digest.digest();
        String output = "";
        for(byte b: hash){
            output += Integer.toString( ( b & 0xff ) + 0x100, 16).substring( 1 );
        }
        System.out.println(output);
    } catch (Exception e) {

    }
}

The output for this is: 249ece65145dca34ed310445758e5504

The following is the piece of code in C#:

   public static string GetMD5Hash()
        {
            string input = "preparar mantecado con coca cola";
            System.Security.Cryptography.MD5CryptoServiceProvider x = new System.Security.Cryptography.MD5CryptoServiceProvider();
            byte[] bs = System.Text.Encoding.Unicode.GetBytes(input);
            bs = x.ComputeHash(bs);
            System.Text.StringBuilder s = new System.Text.StringBuilder();
            foreach (byte b in bs)
            {
                s.Append(b.ToString("x2").ToLower());
            }
            string output= s.ToString();
            Console.WriteLine(output);
        }

The output for this is: c04d0f518ba2555977fa1ed7f93ae2b3

I am not sure, why the outputs are not the same. How do we change the above piece of code, so that both of them return the same output?

Compare your byte arrays first. If they mismatch in even a single bit, the hashes are completely different. There may a BOM or whatever in the UTF-16 encoding. It may be little or big endian, or whatever. — maaartinus, Jan 25 '11 at 12:32

score 35 · Accepted Answer · answered Jan 25 '11 at 12:31

35

UTF-16 != UTF-16.

In Java, getBytes("UTF-16") returns an a big-endian representation with optional byte-ordering mark. C#'s System.Text.Encoding.Unicode.GetBytes returns a little-endian representation. I can't check your code from here, but I think you'll need to specify the conversion precisely.

Try getBytes("UTF-16LE") in the Java version.

answered Jan 25 '11 at 12:31

Nordic Mainframe

28,058
10
66
83

It's worth noting that if you look at the output in eclipse, it still doesn't match what Visual Studio shows you. But strangely it does work... – debracey Dec 05 '11 at 16:05
2015, Java 8.0 * .NET 4.0.x tests based on Polish language, seems be OK like Yoy write. Bytes in both languages are identical, and have *not* BOM prefix. Next important field for tests: Java arithmetic accept overflow silently (good for hash), C# by default not – Jacek Cz Oct 06 '15 at 11:19

score 5 · Answer 2 · answered Jan 25 '11 at 12:33

5

The first thing I can find, and this might not be the only problem, is that C#'s Encoding.Unicode.GetBytes() is littleendian, while Java's natural byte order is bigendian.

answered Jan 25 '11 at 12:33

Mark McKenna

2,857
1
17
17

score 0 · Answer 3 · answered Jan 25 '11 at 12:36

0

You could use the System.Text.Enconding.Unicode.GetString(byte[]) to convert back from byte to string. In this way you're sure that all happens in Unicode encoding.

answered Jan 25 '11 at 12:36

Neonamu

736
1
7
21

UTF-16 Encoding in Java versus C#

3 Answers3

Linked