3

I have problem with comparing lot of string data (csv files). These files has uniqueID but are not sorted and they are quite big.

So I tried to create two dictionaries where key is uniqueID from file and Value is int which returns GetHashCode() of string which I'm interested in for changes.

But, short example:

if ("30000100153:135933:Wuchterlova:335:2:Praha:16000".GetHashCode() == 
    "30000263338:158364:Radošovická:1323:10:Praha:10000".GetHashCode())
{
    Console.WriteLine("Hmm that's strange");
}

So is there any other way how to do that.

I need as little footprit as possible (due to memory allocation of two dictionarie of two csv files which contains about 3M rows) Thank you

andleer
  • 22,388
  • 8
  • 62
  • 82
avojacek
  • 35
  • 1
  • 5
  • I think you will find this thread interesting: http://stackoverflow.com/questions/735317/hashtable-dictionary-collisions – Luc Morin Jan 21 '14 at 17:23
  • 3
    Hash codes aren't unique. They simply can't be, because there are more possible string values, even in strings of `Length` 3 ((2^16)^3=2^48) than possible hashes (2^32). – Tim S. Jan 21 '14 at 17:30
  • The `GetHashCode` implementations are optimized for speed, not uniqueness. If you want to minimize the risk of collisions, use a cryptographic function (such as SHA) instead. – Douglas Jan 21 '14 at 17:31
  • Are the first 11 chars the `Unique ID`? Is it always a number? Possibly you could just parse that into an `int` for comparison. – Wagner DosAnjos Jan 21 '14 at 17:35

2 Answers2

18

First of all, the documentation for string.GetHashCode specifically says to not use string hash codes for any application where they need to be stable over time, because they are not. You should be using string hash codes for one purpose only, and that is to put strings in a dictionary.

Second, hash codes are not unique. There are only four billion possible hash codes (because the hash code is a 32 bit integer) but obviously there are more than four billion strings, so there must be many strings that have the same hash code. A collection of only a few thousand strings has an extremely high probability of containing two strings with the same hash code. A graph of the probability is here:

http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx

So you might wonder how the dictionary works at all then, if it is using GetHashCode but there can be collisions. The answer is: when you put two things X and Y in a dictionary that have the same hash code, they go in the same "bucket". When you search for X the dictionary goes to the right bucket using the hash code, and then does the expensive equality check on each element in the bucket until it finds the right one. Since each bucket is small, this check is still fast enough most of the time.

I don't know how to solve your problem, but using a 32 bit hash is clearly not the right way to do it, so try something else. My suggestion would be to start using a database rather than CSV files if you have a lot of data to manage. That's what a database is for.

I have written many articles on string hashing that might interest you:

http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/

http://blogs.msdn.com/b/ericlippert/archive/2011/07/12/what-curious-property-does-this-string-have.aspx

http://blogs.msdn.com/b/ericlippert/archive/2005/10/24/do-not-use-string-hashes-for-security-purposes.aspx

http://blogs.msdn.com/b/ericlippert/archive/tags/hashing/

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • may be using something like lazy loading dictionary that will load content of files as needed ? – BRAHIM Kamel Jan 21 '14 at 17:35
  • Well, even if he does store his strings in a Dictionary, and even if he does get hash collisions, he might still be able to get away with it. The Dictionary class should handle collisions, but maybe the performance hit would be too much for his application. – Luc Morin Jan 21 '14 at 17:36
  • Problem is that i need compare old/new file to find added/removed or changed rows, which in first column has uniqueID. It has to be memory efficient and fast enough. – avojacek Jan 21 '14 at 17:41
  • The size of each file is about 1GB and RAM of avarage computer is 2GB – avojacek Jan 21 '14 at 17:44
  • @avojacek: RAM is irrelevant. Remember, you get 2GB of *address space* per 32 bit process regardless of how much RAM you have. Having more RAM just makes memory access *faster*, it doesn't enable you to have more address space per process. – Eric Lippert Jan 21 '14 at 17:50
  • @avojacek: This is more evidence that you are using the wrong tool for the job. A 1GB CSV file is craziness. Put that information into a database. – Eric Lippert Jan 21 '14 at 17:51
  • @Eric: I will put that into database, but in database we want to store data which are updated using these csv files. May be it will be faster to drop/ recreate table and import whole csv instead of incremental imports. But there are some relationships between tables, so I tried to find solution without dropping existing data. Anyway thank you for links I will read it asap. And I know that 32bit can handle 2GB I did not mention that it must be a 32bit program before so I try to specify that in comment. – avojacek Jan 21 '14 at 17:57
  • After I read your links. I must say that, I have to come with another solution instead of string.GetHashCode(). I think that stronger SLA1 will be better, but it's slower and footprint is larger. – avojacek Jan 21 '14 at 22:56
0

You don't actually want to use GetHashCode. You should just compare the strings directly. However, comparing each of 3M strings against each of another 3M strings is going to be difficult in any reasonable time without sorting the lists first.

My approach would be to sort each list first (how to do that depends on a number of things), read the first sorted from each - lets call then A and B and:

  1. if A = B then do whatever and read next A and next B and repeat
  2. if A < B do whatever and read next A and repeat
  3. if A > B do whatever and read next B and repeat

.. where 'do whatever' means do whatever is required in that situation and repeat means go back to step 1.

(This process is how mainframe computers used to merge stacks of cards and has a particular name, but I can't for the life of me remember it!)

Cheers -

simon at rcl
  • 7,326
  • 1
  • 17
  • 24