0

I want to write the first part of the Smith-Waterman algorithm in python with basic functions.

I found this example, but it doesn't give me what I'm looking for.

def zeros(X: int, Y: int):
    #        ^       ^  incorrect type annotations. should be str
    lenX = len(X) + 1
    lenY = len(Y) + 1
    matrix = []
    for i in range(lenX):
        matrix.append([0] * lenY)
    # A more "pythonic" way of expressing the above would be:
    # matrix = [[0] * len(Y) + 1 for _ in range(len(x) + 1)]
    
    def score(X, Y):
        #     ^  ^ shadowing variables from outer scope. this is not a bug per se but it's considered bad practice
        if X[n] == Y[m]: return 4
        #    ^       ^  variables not defined in scope
        if X[n] == '-' or Y[m] == '-': return -4
        #    ^              ^  variables not defined in scope
        else: return -2

    def SmithWaterman(X, Y, score): # this function is never called
        #                   ^ unnecessary function passed as parameter. function is defined in scope
        for n in range(1, len(X) + 1):
            for m in range(1, len(Y) + 1):
                align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
                #                 ^ invalid list lookup. should be: matrix[n-1][m-1]
                indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
                #                                          ^ out of bounds error when m == len(Y)
                indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
                #                                  ^ out of bounds error when n == len(X)
        matrix[n, m] = max(align, indelX, indelY, 0)
        # this should be nested in the inner for-loop. m, n, indelX, and indelY are not defined in scope here
    print(matrix)

zeros("ACGT", "ACGT")

On a book I found this algorithm, but I couldn't implement it correctly.

input: sequences s and t, with |s| =n, |t| = m, score function, penality InDel

match +1, mismatch -2, InDel -1

M = matrix of size n+1 * m+1
M[i,j] = 0
i=j=0

enter image description here

Any help please Thanks

  • 1
    what was the issue in the example implementation? It seems pretty simple – Mahrkeenerh Oct 15 '21 at 19:55
  • I am a beginner in python (: –  Oct 15 '21 at 20:08
  • You should at least define which gap-penalties and which score formula you will want to use. Those are free choices, but they should be made. That choice is a matter of opinion, so it cannot be the subject of the question. You'll have to tell us what their definition must be. – trincot Oct 15 '21 at 20:23
  • Thank you @trincot. This is the parameters: match est +1 et mismatch -2. The score function could be like these on the example –  Oct 15 '21 at 20:28
  • So the penalty is constant independent of the gap size? – trincot Oct 15 '21 at 20:31

2 Answers2

0

the image algorithm implementation:

M = []
for i in range(n):
    M.append([])
    for j in range(m):
        first = max(M[i - 1][j - 1] + score(s[i], t[j])
        second = M[i - 1][j] + penal
        third = M[i][j - 1] + penal

        M[i].append(first, second, third, 0))

But you will have to fix the edge cases (out of range) and add some default values.

Mahrkeenerh
  • 1,104
  • 1
  • 9
  • 25
0

The problems with the code you presented are well described in the comments in that piece of code.

Assuming that you want a linear gap-penalty of 2 points, and you are looking for the first phase algorithm only (so excluding the trace-back process), the code can be fixed as follows:

def score(x, y):
    return 4 if x == y else (
        -4 if '-' in (x, y) else -2
    )

def zeros(a, b):
    penalty = 2  # linear penalty (see Wikipedia)
    nextrow = [0] * (len(b) + 1)
    matrix = [nextrow]
    for valA in a:
        row, nextrow = nextrow, [0]
        for m, valB in enumerate(b):
            nextrow.append(max(
                row[m] + score(valA, valB),
                row[m+1] - penalty,
                nextrow[m] - penalty,
                0
            ))
        matrix.append(nextrow)
    return matrix

# Example run:
result = zeros("ACGT", "AC-GT")
print(result)
trincot
  • 317,000
  • 35
  • 244
  • 286