3

I need a 64-bit hash for strings, and the default .GetHashCode() returns only a 32-bit int. I could generate a MD5/SHA1 hash, and use only the first 64bits. But because those algorithms are cryptographically secure, they are much more demanding on the CPU.

Could it be so simple as to just calling .GetHashCode() a second time, on the reverse of the input string? And casting the two 32bit int into a 64-bit long? Would it have the same spread and collision resistance as a 'real' 64bit hash like CRC64?

Maestro
  • 9,046
  • 15
  • 83
  • 116
  • See this answer http://stackoverflow.com/questions/7954602/creating-a-hashcode-for-use-in-a-database-ie-not-using-gethashcode/7960466#7960466 for a 64-bit hash code implementation. – Jim Mischel Nov 09 '11 at 00:11
  • Why would this be "not constructive?" There are good reasons for wanting a hash code that you can guarantee doesn't change. And good reasons for wanting a 64-bit hash code. – Jim Mischel Nov 09 '11 at 00:13
  • Why do you think you need a 64-bit hash? – Dour High Arch Nov 09 '11 at 00:22
  • @DourHighArch Because 32-bit will give me too many collisions, and 128-bit is overkill. – Maestro Nov 09 '11 at 00:27
  • 1
    I would definitely *not* recommend using a hash code for a record key. As Chris Haas pointed out in his answer, a hash code doesn't create a unique value. 64 bits means that the *likelihood* of getting a collision will be small, but there will be collisions. As I said in my linked answer above, "let the database do what it's good at." Using a hash code to "optimize" your database access is almost certainly a bad decision. – Jim Mischel Nov 09 '11 at 00:43
  • @JimMischel Normally I let SQLite do what it's good at, but it has only b-tree indexes, no hash-based ones. If you look at this article http://apollo.backplane.com/matt/crc64.html you will see that collisions are not likely for my dataset (5 million strings). – Maestro Nov 09 '11 at 12:56
  • "Not likely" is not the same as "impossible," as I found out a couple of years ago: http://blog.mischel.com/2009/07/18/highly-unlikely-is-not-the-same-as-impossible/. In your case, the probability of a collision is about one in a million. That said, if you're willing to lose a record (or perhaps corrupt a record) because of a duplicate, then knock yourself out. – Jim Mischel Nov 09 '11 at 16:22

3 Answers3

3

You are about to make a very big mistake. A 64-bit hash isn't nearly good enough to guarantee uniqueness. That requires at least 128 bits. A guid is a common choice.

Generating unique 32-bit or 64-bit numbers isn't that hard, you simple use the next one. The rub is that you need to know the previous one. Dbase engines never have a problem with that, their point of being is remembering stuff.

Use an auto-increment column.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • The problem is that I need to look up strings quickly. If I create an indexed TEXT column in SQLite, all strings are stored twice (because of the b-tree index), and additional inserts become very slow after a million rows (because of b-tree page splitting, etc). I agree with you that it's best to keep the PRIMARY KEY a normal integer, but what's wrong with adding a second column containing the hash, and indexing that instead of the TEXT column? – Maestro Nov 09 '11 at 13:02
  • 1
    I can't see why you think that improves anything. Dbase engines already have good ways to generate a hash from a string. You really need to distinguish the task of looking up records (easy and fast) with inserting rows in a table with a million records. Which tends to highlight the LITE in SqlLite. – Hans Passant Nov 09 '11 at 13:14
  • Most database engines may have ways to generate a hash from a string, but SQLite has none (afaik)? So it's always creating an index on large (text) values, when it just can index a small (integer) value. – Maestro Nov 09 '11 at 13:35
  • 1
    This is getting a bit random. Measure to convince yourself. – Hans Passant Nov 09 '11 at 13:44
  • I already did many benchmarks. The first million rows are inserted in 10 secs, but additional rows (10.000) take 30 secs, which is way too slow. This is caused by the design of SQLite, which needs to read a large part of the existing index from disk, before it can add new rows to the index. By keeping the index small (using hash-values) I want to reduce those 30 seconds. See http://stackoverflow.com/questions/8065949/net-key-value-database – Maestro Nov 09 '11 at 13:51
  • 1
    No, that only told you that it is slow. You already knew that. You have to *compare*. Measure if it is faster when you add an extra column and write more data. – Hans Passant Nov 09 '11 at 13:56
  • Allthough I add an extra column, I actually write _less_ data, because I can remove the INDEX on the TEXT column, which is responsible for 50% of the database size. But you're right, I should prepare a benchmark to confirm it's faster. – Maestro Nov 09 '11 at 14:07
2

Just to get this out of the way, you know that GetHashCode() doesn't generate anything unique, right? Two completely different strings can return the same hash code. The algorithm is only intended for creating even distribution of objects in hashtable.

From the horse's mouth:

The default implementation of the GetHashCode method does not guarantee unique return values for different objects.

Additionally, the rules for what happens when you call GetHashCode() can and will change over time. See the section titled "Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains" here, specifically:

This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

To see someone's collision detection work check this out.

Chris Haas
  • 53,986
  • 12
  • 141
  • 274
0

Is there a particular reason you chose 64 bit? MD5 is more for checking that the content hasn't changed on accident, and SHA is more for making sure the content wasn't changed on purpose. I'd definitely use at LEAST SHA1.

Erik Philips
  • 53,428
  • 11
  • 128
  • 150
  • 1
    Because it's going to serve as a row id in a sqlite database, and those are max 64bits. Also the strings I will be hashing are 50 bytes on average, making even a 8 byte hash almost overkill. – Maestro Nov 09 '11 at 00:19
  • Are there security concerns around the 50 bytes? Are you going to check after the hash is created that the bytes haven't changed or send the hash over the wire securely for a remote location to verify the bytes? – Erik Philips Nov 09 '11 at 00:24
  • No, none of that. It's just used so that sqlite can index shorter values. – Maestro Nov 09 '11 at 00:25
  • I would use the link Jim Mischel recommended for your purpse. – Erik Philips Nov 09 '11 at 00:26