3

Suppose we have two strings "abcdefgh" and "abudesh". I want for solution to be a list ["ab", "de", "h"]. So, I want a list of maximally connected substrings that are the same for both strings. Is this has a name and what would be a good approach in solving it?

Edit: I need to say that the order is not important in the way that if we have, for example, two strings "abcdefg" and "defkabc", the result is ["abc", "def"].

Alem
  • 283
  • 1
  • 13
  • what order of complexity do you want O(n^2)? – Ashish sah Jan 02 '22 at 05:34
  • @Ashishsah Well, I don't have many strings (only about 6000), so any solution would work in that case, probably. – Alem Jan 02 '22 at 05:38
  • 1
    [This](https://github.com/kasravnd/SuffixTree) python package claims to be able to solve this in linear time using suffix trees. I have not tried it myself and there may be more popular suffix tree packages around. – hilberts_drinking_problem Jan 02 '22 at 06:09
  • 1
    @Alem working on it lets see if I reach to an optimal solution using loops. – Ashish sah Jan 02 '22 at 06:11
  • 1
    Have some fun with [Biopython](https://biopython.org/docs/1.76/api/Bio.pairwise2.html). `print( pairwise2.align.globalxx('abcdefgh', 'abudesh') )` prints `[Alignment(seqA='abc-defg-h', seqB='ab-ude--sh', score=5.0, start=0, end=10), Alignment(seqA='abcdefg-h', seqB='abude--sh', score=5.0, start=0, end=9), Alignment(seqA='abc-defgh', seqB='ab-ude-sh', score=5.0, start=0, end=9), Alignment(seqA='abcdefgh', seqB='abude-sh', score=5.0, start=0, end=8), Alignment(seqA='abc-defgh', seqB='ab-udes-h', score=5.0, start=0, end=9), Alignment(seqA='abcdefgh', seqB='abudes-h', score=5.0, start=0, end=8)]` – Stef Jan 02 '22 at 10:21

1 Answers1

1

Using:

from Bio import pairwise2
from itertools import groupby

def maxConnectedSubstrings(strA, strB):
    alignment = pairwise2.align.globalxx(strA, strB)[0]
    grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
    return [''.join(ca for ca,cb in g) for k,g in grouped if k]

print( maxConnectedSubstrings('abcdefgh', 'abudesh') )
# ['ab', 'de', 'h']

Explanation

First, we align the sequences. The result of alignment = pairwise2.align.globalxx(strA, strB)[0] is:

alignment.seqA = 'abcdefgh'
alignment.seqB = 'abude-sh'

The alignment algorithm found the best way to add '-' in the sequences to align them.

Then, we use groupby on zip(alignment.seqA, alignment.seqB). The zip(...) is a sequence of pairs (character from seqA, character from seqB). We group these pairs with the key lambda p: p[0] == p[1], which gives the following result:

grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])

grouped = [
    (True,  [('a', 'a'),
             ('b', 'b')]),
    (False, [('c', 'u')]),
    (True,  [('d', 'd'),
             ('e', 'e')]),
    (False, [('f', '-'),
             ('g', 's')]),
    (True,  [('h', 'h')])
]

Finally, we discard the False groups, and we join the letters of every True group.

Stef
  • 13,242
  • 2
  • 17
  • 28
  • I really appreciate your answer and thank you very much, but I cannot accept this answer (I gave only thums up) because this question is algorithmic in nature. Maybe some day I want to implement the same in C, or Rust, or any other programming language, how to do it? What would be the pseudocode for that? I need that, really. I like Python and I use Python, but this is totally algorithmic question. I would like to understand the approach in solving the same by hand... – Alem Jan 02 '22 at 11:23
  • @Alem The "groupby" function is extremely simple and you could code your own easily in any programming language. The alignment algorithms are much more complex: see [What is the algorithm behind pairwise2 align in BioPython?](https://stackoverflow.com/questions/69042028/what-is-the-algorithm-behind-pairwise2-align-in-biopython) – Stef Jan 02 '22 at 11:34
  • I just checked your solution for two strings on other language and it failed. string_1 = "يستفتونك قل الله يفتيكم فى الكلله " string_2 = " يورث كلله او امراه وله اخ او اخت فلكل وحد منهما السدس فان كان" and I got ['ي', 'و', ' ', 'ل', ' ا', 'له ', ' ', ' ا', 'لك', 'ل', 'ه', ' '] and the answer should not be like that...كلله is not present in the solution. I mean, it works great for ordinary latinic strings, but in Arabic it fails for some reason. – Alem Jan 02 '22 at 11:45
  • I just realized that the problem is not in language, but in order...Here for example if you have two strings "abcdefgh" and "defkabc" it needs to give ["abc", "def"]. So, the order is not important. I just edited the question. – Alem Jan 02 '22 at 11:56
  • 1
    @Alem Well, that's a completely different problem if order doesn't matter. See these questions: [Finding all the common substrings of given two strings](https://stackoverflow.com/questions/34805488/finding-all-the-common-substrings-of-given-two-strings); [Function to find all common substrings in two strings](https://stackoverflow.com/questions/45702566/function-to-find-all-common-substrings-in-two-strings-not-giving-correct-output) – Stef Jan 02 '22 at 12:22
  • Great. Thank you very much for this. I accepted your answer. – Alem Jan 02 '22 at 12:27