I've got about 3,500 files that consist of single line character strings. The files vary in size (from about 200b to 1mb). I'm trying to compare each file with each other file and find a common subsequence of length 20 characters between two files. Note that the subsequence is only common between two files during each comparison, and not common among all files.
I've stuggled with this problem a bit, and since I'm not an expert, I've ended up with a bit of an ad-hoc solution. I use itertools.combinations to build a list in Python that ends up with around 6,239,278 combinations. I then pass the files two at a time to a Perl script that acts a wrapper for a suffix tree library written in C called libstree. I've tried to avoid this type of solution but the only comparable C suffix tree wrapper in Python suffers from a memory leak.
So here's my problem. I've timed it, and on my machine, the solution processes about 500 comparisons in 25 seconds. So that means, it'll take around 3 days of continuous processing to complete the task. And then I have to do it all again to look at say 25 characters instead of 20. Please note that I'm way out of my comfort zone and not a very good programmer, so I'm sure there is a much more elegant way to do this. I thought I'd ask it here and produce my code to see if anyone has any suggestion as to how I could complete this task faster.
Python code:
from itertools import combinations
import glob, subprocess
glist = glob.glob("Data/*.g")
i = 0
for a,b in combinations(glist, 2):
i += 1
p = subprocess.Popen(["perl", "suffix_tree.pl", a, b, "20"], shell=False, stdout=subprocess.PIPE)
p = p.stdout.read()
a = a.split("/")
b = b.split("/")
a = a[1].split(".")
b = b[1].split(".")
print str(i) + ":" + str(a[0]) + " --- " + str(b[0])
if p != "" and len(p) == 20:
with open("tmp.list", "a") as openf:
openf.write(a[0] + " " + b[0] + "\n")
Perl code:
use strict;
use Tree::Suffix;
open FILE, "<$ARGV[0]";
my $a = do { local $/; <FILE> };
open FILE, "<$ARGV[1]";
my $b = do { local $/; <FILE> };
my @g = ($a,$b);
my $st = Tree::Suffix->new(@g);
my ($c) = $st->lcs($ARGV[2],-1);
print "$c";