If your memory problems lie in creating the suffix tree, are you sure you need one? You could find all matches in a single string like this:
word=get_string(4**12)+"$"
def matcher(word, match_string):
positions = [-1]
while 1:
positions.append(word.find(match_string, positions[-1] + 1))
if positions[-1] == -1:
return positions[1:-1]
print matcher(word,'AAAAAAAAAAAA')
[13331731, 13331732, 13331733]
print matcher('AACTATAAATTTACCA','AT')
[4, 8]
My machine is pretty old, and this took 30 secs to run, with 4^12 string. I used a 12 digit target so there would be some matches. Also this solution will find overlapping results - should there be any.
Here is a suffix tree module you could try, like this:
import suffixtree
stree = suffixtree.SuffixTree(word)
print stree.find_substring("AAAAAAAAAAAA")
Unfortunetly, my machine is too slow to test this out properly with long strings. But presumably once the suffixtree is built the searches will be very fast, so for large amounts of searches it should be a good call. Further find_substring
only returns the first match (don't know if this is an issue, I'm sure you could adapt it easily).
Update: Split the string into smaller suffix trees, thus avoiding memory problems
So if you need to do 10 million searches on 4^12 length string, we clearly do not want to wait for 9.5 years (standard simple search, I first suggested, on my slow machine...). However, we can still use suffix trees (thus being a lot quicker), AND avoid the memory issues. Split the large string into manageable chunks (which we know the machines memory can cope with) and turn a chunk into a suffix tree, search it 10 million times, then discard that chunk and move onto the next one. We also need to remember to search the overlap between each chunk. I wrote some code to do this (It assumes the large string to be searched, word
is a multiple of our maximum manageable string length, max_length
, you'll have to adjust the code to also check the remainder at the end, if this is not the case):
def split_find(word,search_words,max_length):
number_sub_trees = len(word)/max_length
matches = {}
for i in xrange(0,number_sub_trees):
stree = suffixtree.SuffixTree(word[max_length*i:max_length*(i+1)])
for search in search_words:
if search not in matches:
match = stree.find_substring(search)
if match > -1:
matches[search] = match + max_length*i,i
if i < number_sub_trees:
match = word[max_length*(i+1) - len(search):max_length*(i+1) + len(search)].find(search)
if match > -1:
matches[search] = match + max_length*i,i
return matches
word=get_string(4**12)
search_words = ['AAAAAAAAAAAAAAAA'] #list of all words to find matches for
max_length = 4**10 #as large as your machine can cope with (multiple of word)
print split_find(word,search_words,max_length)
In this example I limit the max suffix tree length to length 4^10, which needs about 700MB.
Using this code, for one 4^12 length string, 10 million searches should take around 13 hours (full searches, with zero matches, so if there are matches it will be quicker). However, as part of this we need to build 100 suffix trees, which will take around..100*41sec= 1 hour.
So the total time to run is around 14 hours, without memory issues... Big improvement on 9.5 years.
Note that I am running this on a 1.6GHz CPU with 1GB RAM, so you ought to be able to do way better than this!