Using:
from Bio import pairwise2
from itertools import groupby
def maxConnectedSubstrings(strA, strB):
alignment = pairwise2.align.globalxx(strA, strB)[0]
grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
return [''.join(ca for ca,cb in g) for k,g in grouped if k]
print( maxConnectedSubstrings('abcdefgh', 'abudesh') )
# ['ab', 'de', 'h']
Explanation
First, we align the sequences. The result of alignment = pairwise2.align.globalxx(strA, strB)[0]
is:
alignment.seqA = 'abcdefgh'
alignment.seqB = 'abude-sh'
The alignment algorithm found the best way to add '-'
in the sequences to align them.
Then, we use groupby
on zip(alignment.seqA, alignment.seqB)
. The zip(...)
is a sequence of pairs (character from seqA, character from seqB). We group these pairs with the key lambda p: p[0] == p[1]
, which gives the following result:
grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
grouped = [
(True, [('a', 'a'),
('b', 'b')]),
(False, [('c', 'u')]),
(True, [('d', 'd'),
('e', 'e')]),
(False, [('f', '-'),
('g', 's')]),
(True, [('h', 'h')])
]
Finally, we discard the False groups, and we join the letters of every True group.