Find the length of the longest common substring in 'n' binary strings

Question

I am given n strings (n>=2 and n<=4) and each one is constructed using 2 letters only: a and b. In this set of strings I have to find the length of the longest common substring that is present in all the strings. A solution is guaranteed to exist. Let's see an example:

n=4
abbabaaaaabb
aaaababab
bbbbaaaab
aaaaaaabaaab

The result is 5 (because the longest common substring is "aaaab").

I don't have to print (or even know) the substring, I just need to print its length.

It is also given that the result cannot be greater than 60, even though the length of each string can be as high as 13 000.

What I tried is this: I found the smallest length of any string of the given strings and then I compared it with 60 and I chose the smallest value between the two as starting point. Then I started taking sequences of the first string, and the length of each sequence of the first string is len, where len takes values from starting point to 1. At each iteration I take all possible sequences of the first string of length len and I use it as pattern. Using the KMP algorithm (thus, complexity of O(n+m)), I iterated through all the other strings (from 2 to n) and checked if pattern is found in string i. Whenever it isn't found, I break the iteration and try the next sequence available of length len or, if there isn't any, I decrease len and try all sequences that have as length the new, decreased value len. But if it matches, I stop the program and print the length len, since we started from the longest possible length, decreasing at each step, it is logical that the first match that we find represents the largest possible length. Here is the code (but it doesn't really matter since this method is not good enough; I know I shouldn't use using namespace std but it doesn't really affect this program so I just didn't bother):

#include <iostream>
#include <string>
#define nmax 50001
#define result_max 60

using namespace std;

int n,m,lps[nmax],starting_point,len;
string a[nmax],pattern,str;

void create_lps() {
    lps[0]=0;
    unsigned int len=0,i=1;
    while (i < pattern.length()) {
        if (pattern[i] == pattern[len]) {
            len++;
            lps[i] = len;
            i++;
        }
        else {
            if (len != 0) {
                len = lps[len-1];
            }
            else {
                lps[i] = 0;
                i++;
            }
        }
    }
}

bool kmp_MatchOrNot(int index) {
    unsigned int i=0,j=0;
    while (i < a[index].length()) {
        if (pattern[j] == a[index][i]) {
            j++;
            i++;
        }
        if (j == pattern.length()) {
            return true;
        }
        else if (i<a[index].length() && pattern[j]!=a[index][i]){
            if (j != 0) {
                j = lps[j-1];
            }
            else {
                i++;
            }
        }
    }
    return false;
}

int main()
{
    int i,left,n;
    unsigned int minim = nmax;
    bool solution;
    cin>>n;
    for (i=1;i<=n;i++) {
        cin>>a[i];
        if (a[i].length() < minim) {
            minim = a[i].length();
        }
    }

    if (minim < result_max) starting_point = minim;
    else starting_point = result_max;

    for (len=starting_point; len>=1; len--) {
        for (left=0; (unsigned)left<=a[1].length()-len; left++) {
            pattern = a[1].substr(left,len);
            solution = true;
            for (i=2;i<=n;i++) {
                if (pattern.length() > a[i].length()) {
                    solution = false;
                    break;
                }
                else {
                    create_lps();
                    if (kmp_MatchOrNot(i) == false) {
                        solution = false;
                        break;
                    }
                }
            }
            if (solution == true) {
                cout<<len;
                return 0;
            }
        }
    }
    return 0;
}

The thing is this: the program works correctly and it gives the right results, but when I sent the code on the website, it gave a 'Time limit exceeded' error, so I only got half the points.

This leads me to believe that, in order to solve the problem in a better time complexity, I have to take advantage of the fact that the letters of the string can only be a or b, since it looks like a pretty big thing that I didn't use, but I don't have any idea as to how exactly could I use this information. I would appreciate any help.

From a quick glance your code has `for{for{for}}` which seems to be roughly **O(lgn^2)**. This problem has a quite efficient solution, near **O(n)** You should see [this](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm). Its a classic algorithm for this problem — 138, Oct 02 '18 at 22:10
@user3386109 You are right, I typed without checking. The result of the example is indeed 5. Sorry. — , Oct 02 '18 at 22:48
@138 Well, that's the algorithm that I used, KMP (the one you linked). But this isn't just searching and checking. I have to find the longest common substring between N strings, so it's a lot more work to be done. The first 'for' sets the length of the current pattern(of the first string); the second 'for' selects where the current pattern (of the first string) starts. I do this in order to get the actual pattern and build the 'lps[]' array. The 3rd 'for' checks if the current pattern(of the first string) matches or not with all the other strings(from 2 to n). I don't see how I could cut time. — , Oct 02 '18 at 22:56
I think this can be done with a [trie](https://en.wikipedia.org/wiki/Trie). Use the shortest string to build the trie. Then process the other strings, marking nodes that have been visited, but not adding any new nodes. Finally, traverse the trie. The depth of the deepest node that was visited by all strings is the answer. — user3386109, Oct 03 '18 at 00:23
@user3386109 Can you clarify on "Then process the other strings, marking nodes that have been visited, but not adding any new nodes"? How should I do that? Wouldn't that check chunks of the strings that are on the same positions? And what do you mean by "Use the shortest string to build the trie"? What's the point of building a trie using only 1 string? These might be stupid questions, but I never worked with tries before so I would appreciate it if you could clarify a bit. — , Oct 03 '18 at 21:19
Let's wait and see if you can figure out the suffix tree solution. Given strings of length 13000, the worst case trie will have 600000 nodes and take about 12MB of memory (on a 64-bit processor). But the suffix tree should only need 26000 nodes. However, the answer by btilly says you'll need four suffix trees, which increases the node count to 104000. So the suffix tree solution definitely takes less memory. Whether it takes less time depends on how complicated it is to compare to suffix trees. — user3386109, Oct 03 '18 at 21:33
@user3386109 Without memoization, the cost to compare suffix trees is similar to comparing tries. With memoization, it is strictly faster. The suffix tree data structure works out to be what the trie data structure would be if you reused parts of the data structure that happen to be identical, so the solutions are conceptually the same, though the suffix tree is the definite winner. — btilly, Oct 03 '18 at 23:23

btilly · Accepted Answer · 2018-10-03T23:19:33.967

The answer is to build the suffix trees of all of the strings individually, then intersect them. A suffix tree is like a trie that contains all suffixes of one string simultaneously.

Building a suffix tree for a fixed alphabet is O(n) with Ukkonen's algorithm. (If you don't like that explanation, you can use google to find others.) If you have m trees of size n, this is time O(nm).

Intersecting suffix trees is a question of traversing them in parallel, only going further when you can go further in all trees. If you have m trees of size n, this operation can be done in time no more than O(nm).

The overall time of this algorithm is time O(nm). Given that just reading the strings is of time O(nm), you can't do better than that.

Adding a small amount of detail, suppose that your suffix tree is written as one character per node. So each node is just a dictionary whose keys are characters and whose values are the rest of the tree. So to us your example, for the string ABABA the diagram at https://i.stack.imgur.com/bzDRn.jpg would turn into a data structure something like (see below) this one:

{
    'A': {
        'B': {
            '': None,
            'A': {
                'B': {
                    '': None
                }
            }
        }
    },
    'B': {
        '': None
        'A': {
            'B': {
                '': None
            }
        }
    }
}

And likewise BABA would turn into:

{
    'A': {
        '': None
        'B': {
            'A': {
                '': None
            }
        }
    },
    'B': {
        'A': {
            '': None,
            'B': {
                'A': {
                    '': None
                }
            }
        }
    }
}

With data structures that look like this, naive Python to compare them looks like:

def tree_intersection_depth (trees):
    best_depth = 0
    for (char, deeper) in trees[0].items():
        if deeper is None:
            continue
        failed = False

        deepers = [deeper]
        for tree in trees[1:]:
            if char in tree:
                deepers.append(tree[char])
            else:
                failed = True
                break

        if failed:
            continue

        depth = 1 + tree_intersection_depth(deepers)
        if best_depth < depth:
            best_depth = depth

    return best_depth

And you would call it like tree_intersection_depth([tree1, tree2, tree3, ...]).

With the above two trees it does indeed give 3 as the answer.

Now I actually cheated in writing out that data structure. What makes suffix trees efficient is that you DON'T actually have a data structure that looks like that. You have one that reuses all of the repeated structures. So the code to simulate setting up the data structures and calling it looks like this:

b_ = {'B': {'': None}}
ab_ = {'': None, 'A': b_}
bab_ = {'B': ab_}
abab = {'A': bab_, 'B': ab_}

a_ = {'A': {'': None}}
ba_ = {'': None, 'B': a_}
aba_ = {'A': ba_}
baba = {'B': aba_, 'A': ba_}

print(tree_intersection_depth([abab, baba]))

And now we can see that to get the promised performance, there is a missing step. The problem is that while the size of the tree is O(n), while searching it we will potentially visit O(n^2) substrings. In your case you don't have to worry about that, because the substrings are guaranteed to never go to depth more than 60. But in the fully general case you would need to add memoization so that when recursion results in comparing data structures that you have seen before, you immediately return the old answer, and not the new one. (In Python you would use the id() method to compare the address of the object to ones you've seen before. In C++, have a set of pointer tuples for the same purpose.)

I'm sorry but I still don't really understand what you recommend doing, since I've never worked with suffix trees before. So I read about them and understood how to build them, but I don't really understand what to do afterwards. What do you mean by "traversing them in parallel, only going further when you can go further in all trees", what exactly should I do? Take this example: 2 strings: "ABAB" and "BABA". The 1st one has $0 as ending character and the 2nd has $1 as ending character. So the suffix trees would look like this: https://imgur.com/a/tnVlSI1 . How do I get as answer 3? — , Oct 03 '18 at 21:15
@BogdanVlad Recursively try to extend matches of possible strings to all trees in all possible ways, starting with the empty string matching at the root. Your full search of patterns that match should wind up being this: `(empty)`, `A`, `AB`, `ABA`, `B`, `BA`, `BAB`. That is the set of all substrings that appear in both. Two of them have length 3, so that is the answer. — btilly, Oct 03 '18 at 21:56
So I bassically take all the substrings of the first string and search that substring in the rest of the strings (from 2 to n)? Sorry if I'm being annoying but I really want to understand this. — , Oct 03 '18 at 22:11
@BogdanVlad Explaining it was far too much for a comment, so I updated my answer with working Python data structures and code to get the mechanics of the tree comparison across. Hopefully that helps you. — btilly, Oct 03 '18 at 23:20

Find the length of the longest common substring in 'n' binary strings

1 Answers1