0

I have a requirement to hash input strings and produce 14 digit decimal numbers as output.

The math I am using tells me I can have, at maximum, a 46 bit unsigned integer.

I am aware that a 46 bit uint means less collision resistance for any potential hash function. However, the number of hashes I am creating keeps the collision probability in an acceptable range.

I would be most grateful if the community could help me verify that my method for truncating a hash to 46 bits is solid. I have a gut feeling that there are optimizations and/or easier ways to do this. My function is as follows (where bitLength is 46 when this function is called):

    public static UInt64 GetTruncatedMd5Hash(string input, int bitLength)
    {
        var md5Hash = MD5.Create();

        byte[] fullHashBytes = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(input));

        var fullHashBits = new BitArray(fullHashBytes);

        // BitArray stores LSB of each byte in lowest indexes, so reversing...
        ReverseBitArray(fullHashBits);

        // truncate by copying only number of bits specified by bitLength param
        var truncatedHashBits = new BitArray(bitLength);
        for (int i = 0; i < bitLength - 1; i++)
        {
            truncatedHashBits[i] = fullHashBits[i];
        }

        byte[] truncatedHashBytes = new byte[8];

        truncatedHashBits.CopyTo(truncatedHashBytes, 0);

        return BitConverter.ToUInt64(truncatedHashBytes, 0);
    }

Thanks for taking a look at this question. I appreciate any feedback!

Rob Davis
  • 1,299
  • 1
  • 10
  • 22
  • Do you have a rule that you must use the high bits of a truncated byte? And which endianness do you want to use? Little? Big? Or just don't care? Does it need to be consistent across CPU architectures? – CodesInChaos Jan 29 '14 at 21:58
  • What about simply using `BitConverter.ToUInt64(fullHashBits, 0) % (1000000ul * 1000000ul * 100ul)`? That uses native endianness, so if you need consistency across CPU architectures you need to a [fixed endianness `byte[]` to integer converter](http://stackoverflow.com/questions/18648103/is-there-a-better-way-to-detect-endianness-in-net-than-bitconverter-islittleend/18803915#18803915). – CodesInChaos Jan 29 '14 at 22:02
  • 1
    Or `BitConverter.ToUInt64(fullHashBytes.Reverse().ToArray(), 0) & 0x3fffffffffff;` – L.B Jan 29 '14 at 22:03
  • @CodesInChaos, thanks for your responses. I am after little Endian in this case. I am not sure I understand the consequences to using high vs. low bits in this case. Would you be able to shed some light on that? – Rob Davis Jan 29 '14 at 22:27
  • @L.B - thanks for the suggestion. Your bit masking approach seems very clean. – Rob Davis Jan 30 '14 at 17:46

1 Answers1

0

With the help of the comments above, I crafted the following solution:

 public static UInt64 GetTruncatedMd5Hash(string input, int bitLength)
 {
        if (string.IsNullOrWhiteSpace(input)) throw new ArgumentException("input must not be null or whitespace");

        if(bitLength > 64) throw new ArgumentException("bitLength must be <= 64");

        var md5Hash = MD5.Create();

        byte[] fullHashBytes = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(input));

        if(bitLength == 64)
            return BitConverter.ToUInt64(fullHashBytes, 0);

        var bitMask = (1UL << bitLength) - 1UL;

        return BitConverter.ToUInt64(fullHashBytes, 0) & bitMask;
    }

It's much tighter (and faster) than what I was trying to do before.

Rob Davis
  • 1,299
  • 1
  • 10
  • 22