I'm running into a bit of trouble with some code. Please bear in mind that I'm a terrible programmer, so my solution probably isn't very eloquent (and likely the reason why I'm running out of memory - I have 4 gigabytes and the script slowly fills it).
Here's the problem. I've got about 3,500 files in a directory. Each file consists of a single line that could have relatively few or many characters with no spaces (the smallest file is 200 bytes vs. the largest at 1.3 megabytes). What I am trying to do is find a common substring between these files two at a time of set length (in the code below it's 13 characters). I do two at a time because I'm not looking for a common substring among all of them, but rather combinations of two until all the files have been compared. I.e., any common substring of set length between the files, not a substring common to all of them.
I use a suffix tree module that wraps a C implementation (over here). First I make a list of all the files in the directory, then I look for combinations of two so that all combinations are covered, I pass two files at a time to the suffix tree and then look for sequences that are common substrings.
However, I don't really know why it's slowly running out of memory. I hope there's an amendment we can make to the code so that it clears out memory of unused stuff somehow? Obviously 3,500 files will take a long time to process, but I hope that it's possible to do without incrementally filling 4 gigabytes of memory. Any help would be greatly appreciated! Here is the code I've got so far:
from suffix_tree import GeneralisedSuffixTree
from itertools import combinations
import glob, hashlib, os
alist = open('tmp.adjlist', 'w')
def read_f(f):
f = open(f, "r")
s = str(f.readlines())
f.close()
return s
def read_gs(a,b):
s1 = read_f(a)
s2 = read_f(b)
print str(a) + ":" + str(hashlib.md5(s1).hexdigest()) + " --- " + str(b) + ":" + str(hashlib.md5(s2).hexdigest())
return [s1,s2]
def build_tree(s):
hlist = []
stree = GeneralisedSuffixTree(s)
for shared in stree.sharedSubstrings(13):
for seq,start,stop in shared:
hlist.append(hashlib.md5(stree.sequences[seq]).hexdigest())
hlist = list(set(hlist))
for h in hlist:
alist.write(str(h) + " ")
alist.write('\n')
glist = []
for g in glob.glob("*.g"):
glist.append(g)
for a,b in list(combinations(glist, 2)):
s = read_gs(a,b)
build_tree(s)
alist.close()
os.system("uniq tmp.adjlist network.adjlist && rm tmp.adjlist")
UPDATE #1
Here's the updated code. I added the suggestions Pyrce made. However, after jogojapan identified the memory leak in the C code, and given that it's way outside of my expertise, I ended up going with a much slower approach. If anyone is knowledgeable in this area, I'd be really curious to see how to modify the C code to fix the memory leak or the deallocation function, as I think a C suffix tree binding for Python is very valuable. It will probably take a few days to run the data through this script without a suffix tree, so I'm definitely open to seeing if anyone has a creative fix!
from itertools import combinations
import glob, hashlib, os
def read_f(f):
with open(f, "r") as openf:
s = str(openf.readlines())
return s
def read_gs(a,b):
s1 = read_f(a)
s2 = read_f(b)
print str(a) + ":" + str(hashlib.md5(s1).hexdigest()) + " --- " + str(b) + ":" + str(hashlib.md5(s2).hexdigest())
return [s1,s2]
def lcs(S1, S2):
M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
longest, x_longest = 0, 0
for x in xrange(1,1+len(S1)):
for y in xrange(1,1+len(S2)):
if S1[x-1] == S2[y-1]:
M[x][y] = M[x-1][y-1] + 1
if M[x][y]>longest:
longest = M[x][y]
x_longest = x
else:
M[x][y] = 0
return S1[x_longest-longest: x_longest]
glist = glob.glob("*.g")
for a,b in combinations(glist, 2):
s = read_gs(a,b)
p = lcs(s[0],s[1])
if p != "" and len(p) >= 13:
with open("tmp.adjlist", "a") as openf:
openf.write(hashlib.md5(s[1]).hexdigest() + " " + hashlib.md5(s[0]).hexdigest() + "\n")
os.system("uniq tmp.adjlist network.adjlist && rm tmp.adjlist")