-1

I'm somewhat familiar with Biopython's pairwise2 function but I noticed that it adds dashes within the sequence in order to obtain the best possible alignment score. For example,

for a in pairwise2.align.globalxx("ACCGT", "ACG"):
  print(format_alignment(*a))

would yield this result:

ACCGT
|||||
A-CG-
Score=3
<BLANKLINE>
ACCGT
|||||
AC-G-
Score=3
<BLANKLINE>

Even though the first 2 characters (A & C) in the 2nd sequence would align with the 1st sequence. Is there a way to find the number of aligned base pairs and not the highest number of aligned base pairs (e.g.: a sequence of ACTGAA would have a score of 3 against a sequence of GCCGTA)?

superasiantomtom95
  • 521
  • 1
  • 7
  • 25

2 Answers2

0

If you are just trying to prevent the function from adding gaps you can increase the gap penalty. The alignment takes parameters to set the match score, nonmatch penalty, create gap penalty, and extend gap penalty:

pairwise2.align.globalxx("ACCGT", "ACG", 2, -1, -1, -0.5)
Tom Sitter
  • 1,082
  • 1
  • 10
  • 23
  • 1
    This should be ```...globalms(...```. ```globalxx``` does not allow other parameters except the two sequences – Markus Nov 08 '17 at 21:00
0

So you just want to count the identical bases in two sequences (of same length) without doing any alignment?

Like so:

seq1 = 'ACTGAA'
seq2 = 'GCCGTA'

score = 0

for a, b in zip(seq1, seq2):
    if a == b:
        score +=1

print(score)

In a more pythonic way:

seq1 = 'ACTGAA'
seq2 = 'GCCGTA'

score = sum([1 for a, b in zip(seq1, seq2) if a == b])
print(score)

Note that the reverse of this score (number of non-identical bases) would be the Hamming distance. While you can force Biopythons pairwise2 to return your desired result by forcing very high gap penalties, the above shown solution seems simpler.

# I don't recommend this
pairwise2.align.globalxs(seq1, seq2, -1000, -1000)
Markus
  • 385
  • 2
  • 10