0

I am attempting to find the LCS of two DNA sequences. I am outputting the matrix form as well as the string that includes the longest common sequence. However, when I return both matrix and list in my code, I obtain the following error: IndexError: string index out of range

If I were to remove the coding that involves the variable temp and higestcount, my code will nicely output my matrix. I am trying to use similar coding for the matrix to generate my list. Is there a way to avoid this error? Based on the sequences AGCTGGTCAG and TACGCTGGTGGCAT, the longest common sequence should be GCTGGT.

def lcs(x,y):
    c = len(x)
    d = len(y)
    plot = []
    temp = ''
    highestcount = ''

    for i in range(c):
        plot.append([])
        temp.join('')
        for j in range(d):
            if x[i] == y[j]:
                plot[i].append(plot[i-1][j-1] + 1)
                temp.join(temp[i-1][j-1])
            else:
                plot[i].append(0)
                temp = ''
                if temp > highestcount:
                    highestcount = temp

    return plot, temp

x = "AGCTGGTCAG"
y = "TACGCTGGTGGCAT"
test = compute_lcs(x,y)

print test
Roy
  • 1

3 Answers3

0

On the first iteration of temp.join(temp[i-1][j-1]) temp as a variable is an empty string, ''

There's no characters in the string that can be called by index, thus temp[any_number] will throw an index out of range exception.

Ian Price
  • 7,416
  • 2
  • 23
  • 34
  • But i'm starting off with nothing in the string because I want the program to fill in the longest sequence as it progresses through the for loop (and of course replace the longest sequence with one that precedes the one already saved) – Roy Sep 29 '14 at 01:25
  • Python will execute the innermost operation first, that operation being temp[i-1][j-1]. At the time of that operation, temp == '', which has no string characters that can be found by passing an index number to a list (calling `any_string[any_integer]` stores any_string as a list of characters, i.e. ['H','e','l','l','o']). I have no background in genetics so I'm not sure what you are looking to join to the existing temp string, but looking for characters by index in an empty string will never work. Can you provide an example of the output you'd like to see? – Ian Price Sep 29 '14 at 01:33
  • See this page, http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python – han058 Sep 29 '14 at 01:35
0

As far as I know, join() joins an array of string by another string. For example, "-".join(["a", "b", "c"]) will return a-b-c.

Furthermore, you started by defining temp to be a string, but refer to it later with a double index, almost as if it were an array. As far as I know, you can refer to a character in the string with a single index call. For example, a = "foobar", a[3] returns b.

I altered your code to the following. Initializing the arrays to start with to avoid indexing trouble.

def lcs(x,y):
    c = len(x)
    d = len(y)
    plot = [[0 for j in range(d+1)] for i in range(c+1)]
    temp = [['' for j in range(d+1)] for i in range(c+1)]
    highestcount = 0
    longestWord = ''

    for i in range(c):
        for j in range(d):
            if x[i] == y[j]:
                plot[i+1][j+1] = plot[i][j] + 1
                temp[i+1][j+1] = ''.join([temp[i][j],x[i]])
            else:
                plot[i+1][j+1] = 0
                temp[i+1][j+1] = ''
                if plot[i][j] > highestcount:
                    highestcount = plot[i][j]
                    longestWord = temp[i][j]

    return plot, temp, highestcount, longestWord

x = "AGCTGGTCAG"
y = "TACGCTGGTGGCAT"
test = lcs(x,y)
print test
timctran
  • 505
  • 4
  • 10
0

It seems to me that you're going through an unnecessarily elaborate screen, and that's leading to confusion, including the empty string that others have mentioned.

For example, this is still pretty verbose, but I think is easier to follow (and returns the expected answer):

def lcs(seq1, seq2):
    matches = []
    for i in range(len(seq1)):
        j = 1
        while seq1[i:j] in seq2:
            j+=1 
            if j > len(seq1):
                break
        matches.append( (len(seq1[i:j-1]), seq1[i:j-1]) )
    return max(matches)

seq1 = 'AGCTGGTCAG'
seq2 = 'TACGCTGGTGGCAT'
lcs(seq1, seq2)

returns

(6, 'GCTGGT')
iayork
  • 6,420
  • 8
  • 44
  • 49
  • This code is good, but i wouldn't be able to output my matrix. Is there also another way to use i only for one sequence and j for the other sequence? – Roy Sep 29 '14 at 02:34
  • Can you give an example of what you want your output to look like? – iayork Sep 29 '14 at 11:08