Repeated DNA sequence

Question

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.

Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.

Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}

This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.

Why do the standard ways of collision resolution (chaining or probing) not work for this? When using those methods, you also store the original key value so you can check it for equality — c2huc2hu, Jul 18 '18 at 17:36

Srini · Answer 1 · 2018-07-20T21:40:37.990

This is actually a research problem.

Let's come to terms with some facts Input = N, Input length = |N|

You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be

Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.

SO what alternatives do you have?

Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA

The answer looks appropriate but I am not having enough knowledge about any of the listed topics except the suffix tree. — Jhanak Didwania, Jul 19 '18 at 06:38
@user2357112 it was a typo, thanks for pointing it out. Fixed it. — Srini, Jul 20 '18 at 21:41

hiimdaosui · Answer 2 · 2018-07-20T19:01:15.203

Heads-up: This is not a general solution, but a good trick that you can use when k is not large.

The trick is to encrypt the sequence into an integer by bit manipulation.

If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.

For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.

In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.

Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.

By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.

Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.

I think, this solution will work. For a sequence 5A5C -> hash can be calculated as (10^9)*0+(10^8)*0+(10^7)*0+(10^6)*0+(10^5)*0+(10^4)*2+(10^3)*2+(10^2)*2+(10^1)*2+(10^0)*2 — Jhanak Didwania, Jul 19 '18 at 06:48
@JhanakDidwania I'm not sure if you are discussing my solution since my idea is to store the string in an integer and use that integer as the hash key in a hash map. I don't think you need to come up with your own hash function here since even you tried your hash function, collision is basically unavoidable. So just use APIs in your programming languages. — hiimdaosui, Jul 19 '18 at 09:57

score 2 · Answer 3 · answered Jul 20 '18 at 21:13

You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).

After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).

The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.

score 1 · Answer 4 · answered Jul 18 '18 at 17:23

What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Repeated DNA sequence

4 Answers4