Overlap Alignment (overlaps between error-prone reads)

Question

Find a highest-scoring overlap alignment between two strings.

Input: A match score m, a mismatch penalty μ, a gap penalty σ, and two DNA strings s and t.

Output: The maximum alignment score of an overlap alignment between s and t followed by an overlap alignment achieving this maximum score.

Biologists use overlapping reads to assemble a genome, a problem that is complicated by errors in reads. To find overlaps between error-prone reads, we define an overlap alignment of strings s = s1 ... sn and t = t1 ... tm as a global alignment of a suffix of s with a prefix of t. An optimal overlap alignment of strings s and t maximizes the global alignment score between an i-suffix of s and a j-prefix of t (i.e., between si ... sn and t1 ... tj) among all i and j.

Input Format. The first line of the input contains m followed by μ followed by σ (separated by spaces), the second line of the input contains a DNA string s, and the third line of the input contains a DNA string t.

Output Format. The first line of the output should contain the maximum score of an overlap alignment between s and t, and the next two lines should contain an overlap alignment between a suffix of s and a prefix of t achieving this maximum score. Specifically, the second line should contain a suffix of s with gaps placed appropriately, and the third line should contain a prefix of t with gaps placed appropriately.

The following code works for some cases and fails for others (test cases provided below):

def align(m, mu, sigma, s, t):
    # initialize the scoring matrix
    score = [[0 for _ in range(len(t) + 1)] for _ in range(len(s) + 1)]
    backtrack = [[0 for _ in range(len(t) + 1)] for _ in range(len(s) + 1)]

    for i in range(1, len(s) + 1):
        backtrack[i][0] = "↓"
    for j in range(1, len(t) + 1):
        score[0][j] = score[0][j - 1] - sigma
        backtrack[0][j] = "→"

    # fill in the scoring matrix
    for i in range(1, len(s) + 1):
        for j in range(1, len(t) + 1):
            if s[i - 1] == t[j - 1]:
                match = score[i - 1][j - 1] + m
            else:
                match = score[i - 1][j - 1] - mu
            delete = score[i - 1][j] - sigma
            insert = score[i][j - 1] - sigma
            score[i][j] = max(match, delete, insert)
            if score[i][j] == match:
                backtrack[i][j] = "↘"
            elif score[i][j] == delete:
                backtrack[i][j] = "↓"
            else:
                backtrack[i][j] = "→"

    # reconstruct the alignment
    max_score = max(score[-1])
    i = len(s)
    j = score[-1].index(max_score)
    aligned_s, aligned_t = "", ""
    while i > 0 and j > 0:
        if backtrack[i][j] == "↘":
            aligned_s = s[i - 1] + aligned_s
            aligned_t = t[j - 1] + aligned_t
            i -= 1
            j -= 1
        elif backtrack[i][j] == "↓":
            aligned_s = s[i - 1] + aligned_s
            aligned_t = "-" + aligned_t
            i -= 1
        else:
            aligned_s = "-" + aligned_s
            aligned_t = t[j - 1] + aligned_t
            j -= 1
    print(score[len(s)][len(t)])
    print(aligned_s)
    print(aligned_t)


if __name__ == "__main__":
    m, mu, sigma = map(int, input().strip().split())
    s, t = [input().strip() for _ in range(2)]
    align(m, mu, sigma, s, t)

Passing test cases:

Input:
1 1 2
GAGA
GAT

Output:
2
GA
GA

Input:
1 1 1
CCAT
AT

Output:
2
AT
AT

Input:
1 5 1
GAT
CAT

Output:
1
-AT
CAT

Input:
1 5 1
ATCACT
AT

Output:
1
ACT
A-T

Failing test cases:

Input:
1 1 5
ATCACT
ATG

Output:
0
CT
AT

Actual output:
0
"" # empty string
"" # empty string

Input:
3 2 1
CAGAGATGGCCG
ACG

Output:
5
-CG
ACG

Actual output:
5
ATGGCCG
A----CG

Input:
2 3 1
CTT
AGCATAAAGCATT

Correct output:
0
--CT-T
AGC-AT

Actual output:
0
"" # empty string
"" # empty string

I'm having troubles understanding what the actual question is here. Are you asking how to deal with the scenario that no data is presented as input? — Marcelo Paco, Apr 01 '23 at 18:06
@MarceloPaco I will post a solution once it's done; I'm having trouble with it as well because the test cases are not provided. — Ali Alsawad, Apr 03 '23 at 06:37

Overlap Alignment (overlaps between error-prone reads)

0 Answers0