How to speed up Python string matching code

Question

I have this code which computes the Longest Common Subsequence between random strings to see how accurately one can reconstruct an unknown region of the input. To get good statistics I need to iterate it many times but my current python implementation is far too slow. Even using pypy it currently takes 21 seconds to run once and I would ideally like to run it 100s of times.

#!/usr/bin/python

import random
import itertools
#test to see how many different unknowns are compatible with a set of LCS answers.
def lcs(x, y):
    n = len(x)
    m = len(y)
#    table is the dynamic programming table
    table = [list(itertools.repeat(0, n+1)) for _ in xrange(m+1)]
    for i in range(n+1):     # i=0,1,...,n
        for j in range(m+1):  # j=0,1,...,m
            if i == 0 or j == 0:
                table[i][j] = 0
            elif x[i-1] == y[j-1]:
                table[i][j] = table[i-1][j-1] + 1
            else:
                table[i][j] = max(table[i-1][j], table[i][j-1])

    # Now, table[n, m] is the length of LCS of x and y.
    return table[n][m]

def lcses(pattern, text):
    return [lcs(pattern, text[i:i+2*l]) for i in xrange(0,l)]

l = 15
#Create the pattern
pattern = [random.choice('01') for i in xrange(2*l)]

#create text start and end and unknown. 
start = [random.choice('01') for i in xrange(l)]
end = [random.choice('01') for i in xrange(l)]
unknown = [random.choice('01') for i in xrange(l)]

lcslist= lcses(pattern, start+unknown+end)

count = 0
for test in itertools.product('01',repeat = l):
    test=list(test)
    testlist = lcses(pattern, start+test+end)
    if (testlist == lcslist):
        count += 1

print count

I tried converting it to numpy but I must have done it badly as it actually ran more slowly. Can this code be sped up a lot somehow?

Update. Following a comment below, it would be better if lcses used a recurrence directly which gave the LCS between pattern and all sublists of text of the same length. Is it possible to modify the classic dynamic programming LCS algorithm somehow to do this?

Use a C version and wrap it with ctypes or cython. http://rosettacode.org/wiki/Longest_common_subsequence#C — michaelmeyer, Jun 29 '13 at 15:37
@doukremt That sounds like a good idea although I have never done anything like that. Would you mind showing me how this would work in my case? — , Jun 29 '13 at 17:42
Regarding your update above: we don't write code for people on StackOverflow. Try implementing it and then open a new question if you can't get it right. — msw, Jun 30 '13 at 11:57
Well, whatever you do, [don't search StackOverflow for related answers](http://stackoverflow.com/questions/tagged/python+lcs?sort=frequent)! — msw, Jun 30 '13 at 14:44
@msw I know the recurrence for LCS and it's in my question. The problem is a sliding window version which is much less obvious. — , Jun 30 '13 at 16:11

msw · Answer 1 · 2013-06-29T17:04:01.710

The recurrence table table is being recomputed 15 times on every call to lcses() when it is only dependent upon m and n where m has a maximum value of 2*l and n is at most 3*l.

If your program only computed table once, it would be dynamic programming which it is not currently. A Python idiom for this would be

table = None
def use_lcs_table(m, n, l):
    global table
    if table is None:
        table = lcs(2*l, 3*l)
    return table[m][n]

Except using an class instance would be cleaner and more extensible than a global table declaration. But this gives you an idea of why its taking so much time.

Added in reply to comment:

Dynamic Programming is an optimization that requires a trade-off of extra space for less time. In your example you appear to be doing a table pre-computation in lcs() but you build the whole list on every single call and then throw it away. I don't claim to understand the algorithm you are trying to implement, but the way you have it coded, it either:

Has no recurrence relation, thus no grounds for DP optimization, or
Has a recurrence relation, the implementation of which you bungled.

'table' depends on the list 'pattern' and the particular sublist of 'text' it is being called with. I am not sure what you meant. — , Jun 29 '13 at 16:06
You are right that a recurrence that found the LCS between pattern and every sublist of text of the same length directly would be better. Question edited. — , Jun 30 '13 at 10:37

How to speed up Python string matching code

1 Answers1