13

I want to get 64 bit hash code of given string. How can i do that with fastest way ? There is a ready method for get 32 bit hash code but i need 64 bit.

I am looking for only integer hashing. Not md5.

Thank you very much.

C# 4.0

Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
  • Why? Why do you need 64bit hashcode? – Oded Jan 11 '12 at 13:53
  • 2
    I am going to store crawled urls at the database. So for minimizing the collision and having maximum speed i need 64 bit hash code. – Furkan Gözükara Jan 11 '12 at 13:53
  • You think that 32 bits will cause so many collisions? How many URLs are you planning on storing? – Oded Jan 11 '12 at 13:54
  • 1
    If *fast* is the only requirement, you can simply assign the 32 bit hash value to a 64 bit variable. – Codo Jan 11 '12 at 13:54
  • 1
    It is not the only requirement. The main aim is decreasing the possible collision. There can be up to 10 million urls. – Furkan Gözükara Jan 11 '12 at 13:59
  • The address space for 32bits is much larger than 10 million. It feels like you are doing some premature optimization. – Oded Jan 11 '12 at 14:01
  • 2
    Yes but if you calculate with math it has very big risk of collision when there are 10 million strings with 32 bit :) 64 bit is best solution for me. – Furkan Gözükara Jan 11 '12 at 14:02
  • So you have a collision and you have to look at a few more rows to find the match, is that really such an issue with such a small number of strings as 10million? – Jon Hanna Jan 11 '12 at 14:44
  • 1
    The birthday paradox gives that you will have a risk of one in 368936 for a collision with 10 million rows. That is if the hash has a perfect distribution. `1 - e ^ ( -10^7 * (10^7 - 1) / ( 2 * 2^64 ) )` – Jonas Elfström Jan 11 '12 at 15:08
  • If only databases were good at hashing.... – Mark Peters Jan 11 '12 at 15:46
  • @JonasElfström exactly, a tiny number of collisions. It's not like they still aren't going to have to be ready to handle collisions with the 64bit hash. – Jon Hanna Jan 12 '12 at 08:45
  • Right now system is working. So far there are 800k rows and there are so many collisions already with 32 bit. But maximum collision number is 5. – Furkan Gözükara Jan 12 '12 at 09:41

6 Answers6

13

Simple solution:

public static long GetHashCodeInt64(string input)
{
    var s1 = input.Substring(0, input.Length / 2);
    var s2 = input.Substring(input.Length / 2);

    var x= ((long)s1.GetHashCode()) << 0x20 | s2.GetHashCode();

    return x;
}
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
  • Kirill this one or Pratik one would work faster ? And would they produce same result ? – Furkan Gözükara Jan 11 '12 at 14:19
  • @MonsterMMORPG, this will be faster, they will produce different hashes. – Kirill Polishchuk Jan 11 '12 at 14:22
  • 2
    @MonsterMMORPG, Also, if you are storing these hashes prefer MD5 or any other hash implementation (e.g. @Pratik solution), because future version of a `string` might use a different algorithm for calculating the object's hash code. – Kirill Polishchuk Jan 11 '12 at 14:26
  • i need to generate bigint but md5 generates also letters. So i will stick with @Pratik solution – Furkan Gözükara Jan 11 '12 at 14:30
  • 1
    @KirillPolishchuk, there's a bug with this piece of code on some machines (can't pin point the causing spec). If the hashcode of the first half is negative. Consider casting both hashcodes to UInt64 before the SHIFT and OR operations. – giladrv Mar 29 '15 at 14:03
  • 1
    The values from GetHashCode should never be stored to permanent storage like a database. There's no guarantee you'll get consistent values the next time you run your application (especially if you made updates). – Mike Fisher Sep 26 '19 at 18:53
  • This method sometimes throw this exception: ` value was either too large or too small for an int64.` – Tim.Tang Nov 12 '19 at 23:50
  • Note that GetHashCode() does NOT GUARANTEE to give the same value between two invocations. This means that you if you store the hash in a DB and try to look it up later, you will get some funny bugs. In the comments, OP says that this is the use case. – Göran Roseen Aug 18 '20 at 13:47
  • @KirillPolishchuk The OP stated the they would store the hash in a database, and GetHashCode() is not suitable for that. It may return different values for the same data on different invocatios. I recently had a bug caused by this... – Göran Roseen Aug 21 '20 at 14:02
  • @GöranRoseen, the question says nothing about DB – Kirill Polishchuk Aug 23 '20 at 07:12
7

Since the question was about making URL I presume you always need the same hashed 64 bit int. GetHashCode is not relyable in this way. To make a hash with few collisions i use this one.

public static ulong GetUInt64Hash(HashAlgorithm hasher, string text)
{
    using (hasher)
    {
        var bytes = hasher.ComputeHash(Encoding.Default.GetBytes(text));
        Array.Resize(ref bytes, bytes.Length + bytes.Length % 8); //make multiple of 8 if hash is not, for exampel SHA1 creates 20 bytes. 
        return Enumerable.Range(0, bytes.Length / 8) // create a counter for de number of 8 bytes in the bytearray
            .Select(i => BitConverter.ToUInt64(bytes, i * 8)) // combine 8 bytes at a time into a integer
            .Aggregate((x, y) =>x ^ y); //xor the bytes together so you end up with a ulong (64-bit int)
    }
}

To use it just pass whatever hashalgorithm you prefer

ulong result = GetUInt64Hash(SHA256.Create(), "foodiloodiloo")
//result: 259973318283508806

or

ulong result = GetUInt64Hash(SHA1.Create(), "foodiloodiloo")
//result: 6574081600879152103

Difference between this one and the accepted answer is that this one XOR's all the bits, and you can use whatever algorithm you want

  • 1
    I think this answer is seriously underrated. GetHashCode() is, as you point out, NOT GUARANTEED to give the same value between invocations. That means that if you store the hash and try to match it later, you will have funny bugs. Also, you use all the bytes in the bytes array (the current top answer has a bug there) – Göran Roseen Aug 18 '20 at 13:43
  • Seems like this approach (mainly `bytes.Length / 8`) will not work with some algorithms (like SHA1) that produce a hash of indivisible length (e.g. SHA1 produces a 20 bytes hash). – Nick Nov 23 '20 at 03:31
  • @Nick you are right. Had a bug in the code before that only used the first 16 bytes if you used SHA1. Updated it now to resize the array to a multiple of 8. Thank you! – Daniel Richter Nov 24 '20 at 14:55
  • @Daniel Nice. Do you think XORing the chunks of bytes from a SHA hash will avoid collisions like a full SHA hash? Isn't a XOR of [1, 5] and [5, 1] the same despite them being different sequences. I'm just trying to understand if XOR is a safe option for hash bytes over 8-bytes. – Nick Nov 24 '20 at 22:25
5

I'll introduce a new possible answer. xxHash is very fast. Check out the benchmarks here:

https://cyan4973.github.io/xxHash/

It has a NuGet package: https://www.nuget.org/packages/System.Data.HashFunction.xxHash

Or open sources: https://github.com/brandondahler/Data.HashFunction/blob/master/src/System.Data.HashFunction.xxHash/xxHash_Implementation.cs

The other answers here are either 1. questionable as to their real prevention of collision or 2. just wrappers around the large and slow existing HashAlgorithm implementations.

xxHash is not cryptographic strength, but it would seem to fit the bill better for what you need. Its:

  1. 64 bits all the way,
  2. Bench-marked faster than others.
  3. Has good distribution for maximized collision avoidance.
Menace
  • 1,061
  • 13
  • 15
5

This code is from Code Project Article - Convert String to 64bit Integer

 static Int64 GetInt64HashCode(string strText)
{
    Int64 hashCode = 0;
    if (!string.IsNullOrEmpty(strText))
    {
        //Unicode Encode Covering all characterset
          byte[] byteContents = Encoding.Unicode.GetBytes(strText);
        System.Security.Cryptography.SHA256 hash = 
        new System.Security.Cryptography.SHA256CryptoServiceProvider();
        byte[] hashText = hash.ComputeHash(byteContents);
        //32Byte hashText separate
        //hashCodeStart = 0~7  8Byte
        //hashCodeMedium = 8~23  8Byte
        //hashCodeEnd = 24~31  8Byte
        //and Fold
        Int64 hashCodeStart = BitConverter.ToInt64(hashText, 0);
        Int64 hashCodeMedium = BitConverter.ToInt64(hashText, 8);
        Int64 hashCodeEnd = BitConverter.ToInt64(hashText, 24);
        hashCode = hashCodeStart ^ hashCodeMedium ^ hashCodeEnd;
    }
    return (hashCode);
}  
Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
Pratik
  • 11,534
  • 22
  • 69
  • 99
3

I have used the @Kirill solution. I'm a little bit weird and I don't like "var" (I guess it's because I come from c++) so I make a variant:

string s1 = text.Substring(0, text.Length / 2);
string s2 = text.Substring(text.Length / 2);

Byte[] MS4B = BitConverter.GetBytes(s1.GetHashCode());
Byte[] LS4B = BitConverter.GetBytes(s2.GetHashCode());
UInt64 hash = (UInt64)MS4B[0] << 56 | (UInt64)MS4B[1] << 48 | 
              (UInt64)MS4B[2] << 40 | (UInt64)MS4B[3] << 32 |
              (UInt64)LS4B[0] << 24 | (UInt64)LS4B[1] << 16 | 
              (UInt64)LS4B[2] << 8  | (UInt64)LS4B[3] ;

I'm not very sure about the order of the bytes, depends on the machine, (whether is little-endian or big-endian) but, who cares? it's just a number (a hash). Thank you @Kirill, it was very useful to me!

joce
  • 9,624
  • 19
  • 56
  • 74
chasques
  • 140
  • 1
  • 6
  • If you want efficiency as I think you do, maybe you should avoid creating the two byte arrays and shift the integers themselves? – Djof Aug 30 '13 at 15:13
  • 3
    @chasques, if you don't like `var`, then you probably also don't like C++ `auto` .... – Sebastian Oct 08 '13 at 14:17
  • The values from GetHashCode should never be stored to permanent storage like a database. There's no guarantee you'll get consistent values the next time you run your application (especially if you made updates). – Mike Fisher Sep 26 '19 at 18:52
3

I assume you are refering to the MD5 hashing algorithm for your current use?

You can do a SHA 256 for twice the length....

http://msdn.microsoft.com/en-us/library/system.security.cryptography.sha256.aspx

Extract...

byte[] data = new byte[DATA_SIZE];
byte[] result;
SHA256 shaM = new SHA256Managed();
result = shaM.ComputeHash(data);
musefan
  • 47,875
  • 21
  • 135
  • 185