Algorithm to check the order of substrings in combined string

Question

The problem runs as follows: if there are two strings str1 and str2, and another string str3, write a function which checks whether str3 contains both str1's letters and str2's letters in the same sequence as they were in the original sequences, though they may be interleaved. So, adbfec returns true for substrings adf and bec. I have written the following function in Python:

def isinter(str1,str2,str3):
    p1,p2,p3 = 0,0,0
    while p3 < len(str3):
        if p1 < len(str1) and str3[p3] == str1[p1]:
            p1 += 1
        elif p2 < len(str2) and str3[p3] == str2[p2]:
            p2 += 1
        else:
            break
        p3 = p1+p2
    return p3 == len(str3)

There is another version of this program, at ardentart (the last solution). Now which one is better? I think mine, for it probably does it in linear time. Whether it is better or not, is there any further room for optimization in my algo?

Yeah, I can pre-compute `len(str1)` and `len(str2)`. Anything else? — SexyBeast, Sep 23 '12 at 22:25
The `p3 = p1+p2` line. But after thinking it over, both versions accomplish the same thing. — LSerni, Sep 23 '12 at 22:43
Your description doesn't make it clear that it has to be _only_ letters picked from `str1` and `str2`, not just _containing_ both runs. That is "xyzacbuvw" wouldn't be valid for strings "ac" and "ab". — Mu Mind, Sep 23 '12 at 22:59
@hayden No this is just measuring runtime growth, you don't need to worry about operations with constant(O(1)) runtimes. — jamylak, Sep 24 '12 at 09:08
@jamylak Ah! [this](http://stackoverflow.com/questions/1115313/cost-of-len-function) I did not know , thanks. But surely you'll still get a small speed up by caching the result? — Andy Hayden, Sep 24 '12 at 09:34

score 2 · Answer 1 · answered Sep 23 '12 at 22:40

2

Unfortunately, your version just does not work. Imagine input ab, ac, acab. Your algorithm returns False which is not correct.

The problem is that you always walk str1 when the letter seen in str3 can be interpreted both ways; str2 might be necessary to walk, but it does not get equal chance with your algorithm.

answered Sep 23 '12 at 22:40

Jirka Hanika

13,301
3
46
75

And the beauty is that recursion (see [ardentart](http://www.ardendertat.com/2011/10/10/programming-interview-questions-6-combine-two-strings/)) does this complicated (backtracking) walk for you... – Andy Hayden Sep 23 '12 at 22:51

score 1 · Answer 2 · answered Sep 23 '12 at 22:24

1

Another way to approach it would be to use python's regex module re. You could split up the characters of str1, and surround each character with .* to match any number (or none) characters in between them. This will give you the pattern to match str1 by. Then do the same for str2, and then simply run re.match(str1pattern, str3) and re.match(str2pattern, str3). If they both return objects (ie anything but None) then you have a match against both strings.

This will probably scale better as its easier to add more strings to check and if you need better performance to search with various other strings then you can compile the patterns too.

answered Sep 23 '12 at 22:24

CraigDouglas

101
2

Perhaps, but `regex` matching is never a standard method of optimizing string algorithms. Plus it itself is expensive, both in terms of time and space. – SexyBeast Sep 23 '12 at 22:28
As I understood the original question, the OP wants to know if the two strings together form the third - I'm afraid that cases such as common characters or extra characters would interfere with your solution. – LSerni Sep 23 '12 at 22:29

LSerni · Accepted Answer · 2012-09-24T19:53:38.840

You could split all three strings in lists:

list1 = list(str1)

and then walk list3 with the same algorithm you use now, checking whether list3[i] is equal to list1[0] or list2[0]. If it was, you'd del the item from the appropriate list.

Premature list end could then be caught as an exception.

The algorithm would be exactly the same, but implementation ought to be more performant.

UPDATE: turns out it actually isn't (about double the time). Oh well, might be useful to know.

And while benchmarking different scenarios, it turned out that unless it is specified that the three string lengths are "exact" (i.e., len(p1)+len(p2) == len(p3) ), then the most effective optimization is to check first thing. This immediately discards all cases where the two input strings can't match the third because of bad string lengths.

Then I encountered some cases where the same letter is in both strings, and assigning it to list1 or list2 might lead to one of the strings no longer matching. In those cases the algorithm fails with a false negative, which would require a recursion.

def isinter(str1,str2,str3,check=True):
    # print "Checking %s %s and %s" % (str1, str2, str3)
    p1,p2,p3 = 0,0,0
    if check:
        if len(str1)+len(str2) != len(str3):
            return False
    while p3 < len(str3):
        if p1 < len(str1) and str3[p3] == str1[p1]:
            if p2 < len(str2) and str3[p3] == str2[p2]:
                # does str3[p3] belong to str1 or str2?
                if True == isinter(str1[p1+1:], str2[p2:], str3[p3+1:], False):
                   return True
                if True == isinter(str1[p1:], str2[p2+1:], str3[p3+1:], False):
                   return True
                return False
            p1 += 1
        elif p2 < len(str2) and str3[p3] == str2[p2]:
            p2 += 1
        else:
            return False
        p3 += 1
    return p1 == len(str1) and p2 == len(str2) and p3 == len(str3)

Then I ran some benchmarks on random strings, this the instrumentation (notice that it generates always valid shuffles, which may yield biased results):

for j in range(3, 50):
        str1 = ''
        str2 = ''
        for k in range(1, j):
                if random.choice([True, False]):
                        str1 += chr(random.randint(97, 122))
                if random.choice([True, False]):
                        str2 += chr(random.randint(97, 122))
        p1 = 0
        p2 = 0
        str3 = ''
        while len(str3) < len(str1)+len(str2):
                if p1 < len(str1) and random.choice([True, False]):
                        str3 += str1[p1]
                        p1 += 1
                if p2 < len(str2) and random.choice([True, False]):
                        str3 += str2[p2]
                        p2 += 1
        a = time.time()
        for i in range(1000000):
                isShuffle2(str1, str2, str3)
        a = (time.time() - a)
        b = time.time()
        for i in range(1000000):
                isinter(str1, str2, str3)
        b = (time.time() - b)

        print "(%s,%s = %s) in %f against %f us" % (str1, str2, str3, a, b)

The results seem to point to a superior efficiency of the cached+DP algorithm for short strings. When strings get longer (more than 3-4 characters), the cache+DP algorithm starts losing ground. At around length 10, the algorithm above performs twice as fast as the totally-recursive, cached version.

The DP algorithm performs better, but still worse than the above one, if strings contain repeated characters (I did this by restricting the range from a-z to a-i) and if the overlap is slight. For example in this case the DP loses by only 2us:

(cfccha,ddehhg = cfcchaddehhg) in 68.139601 against 66.826320 us

Not surprisingly, full overlap (one letter from each string in turn) sees the larger difference, with a ratio as high as 364:178 (a bit more than 2:1).

Why should it be more performant, when it is requiring auxiliary space? — SexyBeast, Sep 23 '12 at 22:29
@Cupidvogel: I thought it *might* be (actually I confess I believed it would be). I benchmarked, and I was wrong: it is twice as slow to confirm a match, and it varies (but always slower) to verify a mismatch. Using list indexes helps, but not much, as does popping lists from the *bottom*. — LSerni, Sep 23 '12 at 23:02
Thanks. That looks like a good solution. One question, while recursing, you check `str1[p1+1:], str2[p2:]` and `str1[p1:], str2[p2+1:]` in order, because that is the order in which the outer `if` conditionals are nested, right? In this case, the recursion would return `false` immediately for `str1`, since its 2nd letter is `b` while `str3`'s 2nd letter is `b`. However, `str1` and `str3` may continue to be similar for more than 1 subsequent letter before making a mismatch, thus showing that `str2` should have matched that letter instead of `str1`. Won't this return false positive? — SexyBeast, Sep 24 '12 at 09:03
No no, it's okay now. Thanks, I got it. Can you compare and contrast the efficiency of this method and the one in the linked article? Does this one need any kind of DP/caching? — SexyBeast, Sep 24 '12 at 11:31
@Cupidvogel, I ran some tests; probably due to implementation details (both algorithms have the same complexity), and I suspect cache penalty, this algorithm - which needs no caching - performs from "slightly worse" in a few select cases, to "appreciably faster" on average, up to "twice as fast". See answer for the sordid details. — LSerni, Sep 24 '12 at 19:56

score 0 · Answer 4 · answered Sep 24 '12 at 08:57

First, just an implementation point: I think you may get rid of tests on lengths of str1 and str2. In C, strings are terminated with nul characters, so this special character will never been found in str3. So just put p1++ if you find a correct character. But in python I don't remember if this feature stands... Sorry, I am not a serious python user. What is the output of str1[p1] if p1==len(p1)?

In addition to this, as pointed by Jirka Hanika, the output of your code is wrong. I have seen another situation where it fails: if a character is common with both substrings. Ex: if str1="abc", str2="dbe", then str3="adbec" contains both str1 and str2, but your algorithm fails on this case. The problem comes from the elif statement, instead put another if.

The output of the code by Iserni seems to me to be the correct one.

Algorithm to check the order of substrings in combined string

4 Answers4