Find contacting and non-contacting part of two strings

Question

I have two examples of pair of strings

YHFLSPYVY      # answer
   LSPYVYSPR   # prediction
+++******ooo


  YHFLSPYVS    # answer
VEYHFLSPY      # prediction
oo*******++

As stated above I'd like to find the overlapping region (*) and non-overlapping region in answer (+) and prediction (o).

How can I do it in Python?

I'm stuck with this

import re
# This is of example 1
ans = "YHFLSPYVY"
pred= "LSPYVYSPR"
matches = re.finditer(r'(?=(%s))' % re.escape(pred), ans)
print [m.start(1) for m in matches]
#[]

The answer I hope to get for example 1 is:

plus_len = 3
star_len = 6
ooo_len = 3

Do you also want the string with *+o or just the values of plus_len etc.? — Anupam Mohanty, Jul 27 '16 at 12:37
Looks like [longest common subsequence](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) — Moon Cheesez, Jul 27 '16 at 12:37
Will a single character common to both also be considered an overlap? — Moses Koledoye, Jul 27 '16 at 12:42

vaultah · Accepted Answer · 2016-07-27T13:08:50.503

3

It's easy with difflib.SequenceMatcher.find_longest_match:

from difflib import SequenceMatcher

def f(answer, prediction):
    sm = SequenceMatcher(a=answer, b=prediction)
    match = sm.find_longest_match(0, len(answer), 0, len(prediction))
    star_len = match.size
    return (len(answer) - star_len, star_len, len(prediction) - star_len)

The function returns a 3-tuple of integers (plus_len, star_len, ooo_len):

f('YHFLSPYVY', 'LSPYVYSPR') -> (3, 6, 3)
f('YHFLSPYVS', 'VEYHFLSPY') -> (2, 7, 2)

edited Jul 27 '16 at 13:08

answered Jul 27 '16 at 12:48

vaultah

44,105
12
114
143

SO is like an army of super intelligent minds, not even a second and the question was answered :D ! – e-nouri Jul 27 '16 at 12:54

score 1 · Answer 2 · answered Jul 27 '16 at 12:52

You can use difflib:

import difflib

ans = "YHFLSPYVY"
pred = "LSPYVYSPR"

def get_overlap(s1, s2):
     s = difflib.SequenceMatcher(None, s1, s2)
     pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2))
     return s1[pos_a:pos_a+size]

overlap = get_overlap(ans, pred)
plus = ans.replace(get_overlap(ans, pred), "")
oo = pred.replace(get_overlap(ans, pred), "")

print len(overlap)
print len(plus)
print len(oo)

Find contacting and non-contacting part of two strings

2 Answers2