Find a highest-scoring overlap alignment between two strings.
Input: A match score m, a mismatch penalty μ, a gap penalty σ, and two DNA strings s and t.
Output: The maximum alignment score of an overlap alignment between s and t followed by an overlap alignment achieving this maximum score.
Biologists use overlapping reads to assemble a genome, a problem that is complicated by errors in reads. To find overlaps between error-prone reads, we define an overlap alignment of strings s = s1 ... sn and t = t1 ... tm as a global alignment of a suffix of s with a prefix of t. An optimal overlap alignment of strings s and t maximizes the global alignment score between an i-suffix of s and a j-prefix of t (i.e., between si ... sn and t1 ... tj) among all i and j.
Input Format. The first line of the input contains m followed by μ followed by σ (separated by spaces), the second line of the input contains a DNA string s, and the third line of the input contains a DNA string t.
Output Format. The first line of the output should contain the maximum score of an overlap alignment between s and t, and the next two lines should contain an overlap alignment between a suffix of s and a prefix of t achieving this maximum score. Specifically, the second line should contain a suffix of s with gaps placed appropriately, and the third line should contain a prefix of t with gaps placed appropriately.
The following code works for some cases and fails for others (test cases provided below):
def align(m, mu, sigma, s, t):
# initialize the scoring matrix
score = [[0 for _ in range(len(t) + 1)] for _ in range(len(s) + 1)]
backtrack = [[0 for _ in range(len(t) + 1)] for _ in range(len(s) + 1)]
for i in range(1, len(s) + 1):
backtrack[i][0] = "↓"
for j in range(1, len(t) + 1):
score[0][j] = score[0][j - 1] - sigma
backtrack[0][j] = "→"
# fill in the scoring matrix
for i in range(1, len(s) + 1):
for j in range(1, len(t) + 1):
if s[i - 1] == t[j - 1]:
match = score[i - 1][j - 1] + m
else:
match = score[i - 1][j - 1] - mu
delete = score[i - 1][j] - sigma
insert = score[i][j - 1] - sigma
score[i][j] = max(match, delete, insert)
if score[i][j] == match:
backtrack[i][j] = "↘"
elif score[i][j] == delete:
backtrack[i][j] = "↓"
else:
backtrack[i][j] = "→"
# reconstruct the alignment
max_score = max(score[-1])
i = len(s)
j = score[-1].index(max_score)
aligned_s, aligned_t = "", ""
while i > 0 and j > 0:
if backtrack[i][j] == "↘":
aligned_s = s[i - 1] + aligned_s
aligned_t = t[j - 1] + aligned_t
i -= 1
j -= 1
elif backtrack[i][j] == "↓":
aligned_s = s[i - 1] + aligned_s
aligned_t = "-" + aligned_t
i -= 1
else:
aligned_s = "-" + aligned_s
aligned_t = t[j - 1] + aligned_t
j -= 1
print(score[len(s)][len(t)])
print(aligned_s)
print(aligned_t)
if __name__ == "__main__":
m, mu, sigma = map(int, input().strip().split())
s, t = [input().strip() for _ in range(2)]
align(m, mu, sigma, s, t)
Passing test cases:
Input:
1 1 2
GAGA
GAT
Output:
2
GA
GA
Input:
1 1 1
CCAT
AT
Output:
2
AT
AT
Input:
1 5 1
GAT
CAT
Output:
1
-AT
CAT
Input:
1 5 1
ATCACT
AT
Output:
1
ACT
A-T
Failing test cases:
Input:
1 1 5
ATCACT
ATG
Output:
0
CT
AT
Actual output:
0
"" # empty string
"" # empty string
Input:
3 2 1
CAGAGATGGCCG
ACG
Output:
5
-CG
ACG
Actual output:
5
ATGGCCG
A----CG
Input:
2 3 1
CTT
AGCATAAAGCATT
Correct output:
0
--CT-T
AGC-AT
Actual output:
0
"" # empty string
"" # empty string