11

I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it.

The following is the piece of code in Java:

public static void main(String[] args) {
    String str = "preparar mantecado con coca cola";
    try {
        MessageDigest digest = MessageDigest.getInstance("MD5");
        digest.update(str.getBytes("UTF-16"));
        byte[] hash = digest.digest();
        String output = "";
        for(byte b: hash){
            output += Integer.toString( ( b & 0xff ) + 0x100, 16).substring( 1 );
        }
        System.out.println(output);
    } catch (Exception e) {

    }
}

The output for this is: 249ece65145dca34ed310445758e5504

The following is the piece of code in C#:

   public static string GetMD5Hash()
        {
            string input = "preparar mantecado con coca cola";
            System.Security.Cryptography.MD5CryptoServiceProvider x = new System.Security.Cryptography.MD5CryptoServiceProvider();
            byte[] bs = System.Text.Encoding.Unicode.GetBytes(input);
            bs = x.ComputeHash(bs);
            System.Text.StringBuilder s = new System.Text.StringBuilder();
            foreach (byte b in bs)
            {
                s.Append(b.ToString("x2").ToLower());
            }
            string output= s.ToString();
            Console.WriteLine(output);
        }

The output for this is: c04d0f518ba2555977fa1ed7f93ae2b3

I am not sure, why the outputs are not the same. How do we change the above piece of code, so that both of them return the same output?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
rkg
  • 5,559
  • 8
  • 37
  • 50
  • Compare your byte arrays first. If they mismatch in even a single bit, the hashes are completely different. There may a BOM or whatever in the UTF-16 encoding. It may be little or big endian, or whatever. – maaartinus Jan 25 '11 at 12:32

3 Answers3

35

UTF-16 != UTF-16.

In Java, getBytes("UTF-16") returns an a big-endian representation with optional byte-ordering mark. C#'s System.Text.Encoding.Unicode.GetBytes returns a little-endian representation. I can't check your code from here, but I think you'll need to specify the conversion precisely.

Try getBytes("UTF-16LE") in the Java version.

Nordic Mainframe
  • 28,058
  • 10
  • 66
  • 83
  • It's worth noting that if you look at the output in eclipse, it still doesn't match what Visual Studio shows you. But strangely it does work... – debracey Dec 05 '11 at 16:05
  • 2015, Java 8.0 * .NET 4.0.x tests based on Polish language, seems be OK like Yoy write. Bytes in both languages are identical, and have *not* BOM prefix. Next important field for tests: Java arithmetic accept overflow silently (good for hash), C# by default not – Jacek Cz Oct 06 '15 at 11:19
5

The first thing I can find, and this might not be the only problem, is that C#'s Encoding.Unicode.GetBytes() is littleendian, while Java's natural byte order is bigendian.

Mark McKenna
  • 2,857
  • 1
  • 17
  • 17
0

You could use the System.Text.Enconding.Unicode.GetString(byte[]) to convert back from byte to string. In this way you're sure that all happens in Unicode encoding.

Neonamu
  • 736
  • 1
  • 7
  • 21