3

Here's my problem (I'm programming in C):

I have some huge text files containing DNA sequences (each file has something like 65 million rows and a size of about 4~5 GB). In these files there are many duplicates (don't know how many yet, but there should be many millions of them) and I want to return in output a file with only distinct values. Each string has a quality value associated, so if e.g I have 5 equal strings with different quality values I'll hold the best one and discard the other 4.

Reducing memory requirements and improving speed efficiency as far as I can is VITAL. My idea was to create a JudyHS array using an hash function in order to convert the String DNA sequence (which is 76 letters long and has 7 possible characters) into an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes on many millions of entries should be quite an achievement). This way I could use the integer as index and store only the best quality value for that index. The problem is that I can't find an hash function that UNIVOCALLY defines such a long string and produces a value that can be stored inside an integer or even a long long!

My first idea for an hash function was something like the default string hash function in Java: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1], but I could obtain a maximal value of 8,52*10^59.. way tooo big. What about doing the same thing and store it in a double? Would the computation become a lot slower? Please note that I'd like a way to UNIVOCALLY define a string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, quite a costly operation...)

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Alex
  • 301
  • 5
  • 13
  • Not answering your question, but hoping to solve your problem : would a [prefix tree](http://en.wikipedia.org/wiki/Trie) be an appropriate data structure to hold your data compactly? – Robᵩ May 03 '11 at 15:29
  • Thanks for the answer, but from what I understood that's pratically what the Judy array is, and anyway, I've read claims that it's more space efficient to them, so I want to give it a try – Alex May 03 '11 at 16:12

2 Answers2

3

You have 7^76 possible DNA sequences and want to map them to 2^32 hashes without collisions? Not possible.

You need a least log2(7^76) = 214 bits to do that, about 27 bytes.

I you can live with some collisions I would recommend to stick to CRC32 or md5 instead of inventing a new wheel again.

Gunther Piez
  • 29,760
  • 6
  • 71
  • 103
  • Is there any kind of algorithm that allows me to code those 7^76 possible values in those 214 bits? – Alex May 03 '11 at 16:01
  • @Alex: Without special long integer arithmetic I would use groups of 7^11, which can be encoded in less than 32 bit, and use 7 of those 32 bit integers. Assuming the values in your sequences are in the range 0..6: For each group of 11 values calculate (((v0*7 + v1)*7 + v2)*7 +v3)*7 ... + v10. This will be less than 1977326743 and fit in a 32 bit integer. Calculate this for 7 groups, setting v10 in the last group to zero. – Gunther Piez May 03 '11 at 22:16
  • But I would rather use a simple hash function and a sufficient long key. As Thomas wrote, just using a 64 bit hash will make collisions very unlikely. Lookup http://en.wikipedia.org/wiki/Birthday_problem : The probability of a collision in the table is less than 1/1000 in your case. – Gunther Piez May 03 '11 at 22:26
  • The problem with collisions is that there's the risk of discarding some values that shouldn't discarded, and I'm not allowed to discard even one string for the sake of efficiency. I think I will implement your idea of 7 groups and add the option to also use the more efficient (but sligthly unsafe) suggestion of Thomas. Thank you both! – Alex May 04 '11 at 08:39
  • Note the the probability of a collision is not 1/1000th per access to a string but rather in the whole table of 65M entries. Or to put it the other way around: If you have 1000 tables of 5 GByte each, making that 5 TByte of data, most likely only in one of this tables there will be a collision, the other 999 will be collision free. – Gunther Piez May 04 '11 at 10:10
1

The "simple" way to get a collision-free hash function for N elements is to use a good mixing function (say, a cryptographic hash function) and to truncate the size, so that the hash results live in a space of size at least N2. Here, you have 65 million rows -- this fits on 26 bits (226 is close to 65 millions) so 52 bits "ought to be enough".

You can try using a fast cryptographic hash function, even a "broken" one since this is not a security-related problem. MD4, MD5, SHA-1... then truncate the result to the first (or last) 64 bits, store that in a 64-bit integer type. Chances are that you will not get any collision among your 65 million rows; and if you get some, they will be very rare.

For optimized C implementations of hash functions, lookup sphlib. Use the provided sph_dec64le() function to "decode" a sequence of 8 bits into a 64-bit unsigned integer value.

Thomas Pornin
  • 72,986
  • 14
  • 147
  • 189
  • The problem is that every time I get a collision I need to discriminate if it's caused by a duplicate (and so discard it) or by a different value (and save it) without having the original string stored in the hash table. If there were some kind of algorithm which allows me to do that (but I can't imagine how), the number of collisions would be almost irrilevant.. – Alex May 03 '11 at 15:58