3

I was under the impression that the following function has a time complexity of O(mn) with m and n being the length of the two strings. However, someone disagrees because he claims that string concatenation involves the copy of characters and hence for long string sequences this won't be O(mn) anymore. That sounds reasonable. What would be a Python bottom-up implementation that won't concatenate strings? Below is an implementation that involves string concatenation.

def lcs_dynamic_programming(s1, s2):
matrix = [["" for x in range(len(s2))] for x in range(len(s1))]
print(matrix)
for i in range(len(s1)):
    for j in range(len(s2)):
        if s1[i] == s2[j]:
            if i == 0 or j == 0:
                matrix[i][j] = s1[i]
            else:
                matrix[i][j] = matrix[i-1][j-1] + s1[i]
        else:
            matrix[i][j] = max(matrix[i-1][j], matrix[i][j-1], key=len)

cs = matrix[-1][-1]

return len(cs), cs

Edit: For example, consider s1 = "thisisatest", s2 = "testing123testing" with an lcs of "tsitest"

Matt
  • 7,004
  • 11
  • 71
  • 117

1 Answers1

2

String concatenation costs O(n) because strings are immutable in Python and have to be copied to a new string for every concatenation. To avoid string concatenation, you can use a list of characters in place of a string in the matrix construction, and join the final list of characters into a string only upon return:

def lcs_dynamic_programming(s1, s2):
    matrix = [[[] for x in range(len(s2))] for x in range(len(s1))]
    for i in range(len(s1)):
        for j in range(len(s2)):
            if s1[i] == s2[j]:
                if i == 0 or j == 0:
                    matrix[i][j] = [s1[i]]
                else:
                    matrix[i][j] = matrix[i - 1][j - 1] + [s1[i]]
            else:
                matrix[i][j] = max(matrix[i - 1][j], matrix[i][j - 1], key=len)
    cs = matrix[-1][-1]
    return len(cs), ''.join(cs)
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • your code above returns the wrong results for longer sequences – Matt Apr 27 '21 at 01:18
  • I see. Can you update your question with the long sequences that would reproduce such wrong results? – blhsing Apr 27 '21 at 01:24
  • I edited my post and included an example – Matt Apr 27 '21 at 01:33
  • @Matt Ah, my previous attempt was fundamentally flawed indeed. Rewrote it as a minor modification to your code instead. – blhsing Apr 27 '21 at 02:07
  • That is still not right, using s1="GAC", and s2="AGCAT" should produce either "AC", "GC", or "GA", but your algorithm outputs "GAC" which is incorrect – Matt Apr 27 '21 at 02:12
  • @Matt Already edited with a fix. (Thought I could get away with not having the `if i == 0 or j == 0:` condition.) – blhsing Apr 27 '21 at 02:13
  • 1
    This works, many thanks. This runs at O(mn), I guess space efficiency-wise this can be further optimized but goes beyond my question. – Matt Apr 27 '21 at 02:17