20

I need to find the overlap between 2 sequences. The overlap is in the left end of the first sequences and the right end of the second one. I want the function to find the overlap, and return it.

My sequences are:

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

My function should be named

def getOverlap(left, right)

With s1 being the left sequence, and the s2 being the right one.

The outcome should be

'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'

Any help is appreciated.

Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
anne
  • 391
  • 2
  • 4
  • 9

4 Answers4

25

Have a look at the difflib library and more precisely at find_longest_match():

import difflib

def get_overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[pos_a:pos_a+size]

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

print(get_overlap(s1, s2)) # GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC
Nicolas
  • 5,583
  • 1
  • 25
  • 37
13

You could use difflib.SequenceMatcher:

d = difflib.SequenceMatcher(None,s1,s2)
>>> match = max(d.get_matching_blocks(),key=lambda x:x[2])
>>> match
Match(a=8, b=0, size=39)
>>> i,j,k = match
>>> d.a[i:i+k]
'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'
>>> d.a[i:i+k] == d.b[j:j+k]
True
>>> d.a == s1
True
>>> d.b == s2
True
mgilson
  • 300,191
  • 65
  • 633
  • 696
6

The Knuth-Morris-Pratt algorithm is a nice method for finding one string inside another (since I saw DNA, I'm guessing you'd like this to scale to... billions?).

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

from __future__ import generators

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

The link where I got the KMP python code (and a builtin, which will be faster for small problems because of the runtime constant).

For bleeding-edge performance, use a prefix table and hash windows of your string as base 4 integers (in biology you'd call them k-mers or oligos). ; )

Good luck!

EDIT: There's also a nice trick where you sort a list containing every prefix (n total) in the first string and every prefix (n total) in the second string. If they share the largest common subsequence, then they must be adjacent in the sorted list, so find the element from the other string that is closest in the sorted list, and then take the longest prefix that matches completely. : )

user
  • 7,123
  • 7
  • 48
  • 90
5

Longest Common Substring

def LongestCommonSubstring(S1, S2):
  M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
  longest, x_longest = 0, 0
  for x in xrange(1,1+len(S1)):
    for y in xrange(1,1+len(S2)):
        if S1[x-1] == S2[y-1]:
            M[x][y] = M[x-1][y-1] + 1
            if M[x][y]>longest:
                longest = M[x][y]
                x_longest  = x
        else:
            M[x][y] = 0
  return S1[x_longest-longest: x_longest]
munk
  • 12,340
  • 8
  • 51
  • 71
  • This is the best i can come up with: – anne Jan 02 '13 at 20:42
  • left = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC" right = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC" def getOverlap(left,right): if left == right[::-1]: return "" else: for i in left: for i in right: if left[len(left)-i]==(right[::-1])[len(right)-i]: if False: continue return right[:len(left)-i] – anne Jan 02 '13 at 20:42
  • 1
    @anne You should put that comment in your post, so it's formatted and more readable. – munk Jan 02 '13 at 20:43
  • Thanks. If i turn s1 and s2 around, then i should get the answer 'c', but i get the same as if s1 is the left one and s2 the right one. – anne Jan 02 '13 at 21:03
  • 1
    The question is not about common substrings. It simply looks at the prefix of the first string and the prefix of the second string. – poke Jan 02 '13 at 23:07