Optimisation ideas - Longest common substring

Question

I have this program which is supposed to find the Longest Common Substring of a number of strings. Which it does, but if the strings are very long (i.e. >8000 characters long), it works slowly (1.5 seconds). Is there any way to optimise that?

The program is this:

//#include "stdafx.h"
#include <iostream>
#include <string>
#include <vector>
#include <cassert>


using namespace std;

const unsigned short MAX_STRINGS = 10;
const unsigned int  MAX_SIZE=10000;
vector<string> strings;
unsigned int len;

string GetLongestCommonSubstring( string string1, string string2 );
inline void readNumberSubstrings();
inline const string getMaxSubstring();

void readNumberSubstrings()
{
    cin >> len;

    assert(len > 1 && len <=MAX_STRINGS);

    strings.resize(len);

    for(register unsigned int i=0; i<len;i++)
        strings[i]=string(MAX_SIZE,0);

    for(register unsigned int i=0; i<len; i++)
        cin>>strings[i];
}

 const string getMaxSubstring()
{
    string maxSubstring=strings[0];
    for(register unsigned int i=1; i < len; i++)
        maxSubstring=GetLongestCommonSubstring(maxSubstring, strings[i]);
    return maxSubstring;
}

string GetLongestCommonSubstring( string string1, string string2 ) 
{

    const int solution_size = string2.length()+ 1;

    int *x=new int[solution_size]();
    int *y= new int[solution_size]();

    int **previous = &x;
    int **current = &y;

    int max_length = 0;
    int result_index = 0;

    int j;
    int length;
    int M=string2.length() - 1;

    for(register int i = string1.length() - 1; i >= 0; i--)
    {
        for(register int j = M; j >= 0; j--) 
        {
            if(string1[i] != string2[j]) 
                (*current)[j] = 0;
            else 
            {
                length = 1 + (*previous)[j + 1];
                if (length > max_length)
                {
                    max_length = length;
                    result_index = i;
                }

                (*current)[j] = length;
            }
        }

        swap(previous, current);
    }
    string1[max_length+result_index]='\0';
    return &(string1[result_index]);
}

int main()
{
    readNumberSubstrings();
    cout << getMaxSubstring() << endl;
    return 0;
}

Note: there is a reason why I didn't write code that would solve this problem with suffix trees (they're large).

What about parallelization? You could easily use OpenMP for that. The only problem is that you have to crawl the table diagonally. — Radim Vansa, Sep 20 '13 at 11:36
The thought has occurred, but I'd like to optimize what I have first. — Chiffa, Sep 20 '13 at 11:40
Well, if I try it in a comment, it will be more than 9000 characters "too long". — Chiffa, Sep 20 '13 at 11:42
`register` is pointless with modern compilers and you don't need double indirection for `previous` and `current`. It probably won't help much with performance though. — zch, Sep 20 '13 at 11:43
@Chiffa a small input sample suffices. But without it, I can't even execute your code.. — stefan, Sep 20 '13 at 11:43
Perhaps, but for the moment I can't think of anything much better. — Chiffa, Sep 20 '13 at 11:45
Well, for example I input 10 random strings like this: 10 mrqmbmdztpposfyjawqnsgghziumfuwekudgvtwbmpvp qwikbdlefitrmbmdztpposfyjawqncjgvwqyurt kjehpbvoodpdwionigwjbczmddlomvmbmdztpposfyjawqnio hyhpqeambmdztpposfyjawqn tiewwnhvdtloibzmrgsywmbmdztpposfyjawqndcupawrytbakvayvc vbqqyorqepgzcggfkviyerjmbmdztpposfyjawqnb ysdxhmbmdztpposfyjawqnbopejbwokcnjabwrjxc mbmdztpposfyjawqn huznybnvcuzfqmbmdztpposfyjawqnofkonyskbsig xgokylmbmdztpposfyjawqnuqlfzreibpdcaevptrpxxdvv — Chiffa, Sep 20 '13 at 11:47
Anyway, its a longest common substring problem; sample inputs are easy to come up with. Just input something like 3 sample1 sample2 samplesample3 — Chiffa, Sep 20 '13 at 11:49
Won't make much of a difference, but you should pass the arguments to `GetLongestCommonSubstring` as `const string&` and not `string`. In C++ we pass objects by reference to avoid copying. — András Aszódi, Sep 20 '13 at 11:56

score 1 · Answer 1 · answered Sep 20 '13 at 12:31

Often when it comes to optimization, a different approach might be your only true option rather than trying to incrementally improve the current implementation. Here's my idea:

create a list of valid characters that might appear in the longest common substring. I.e., if a character doesn't appear in all strings, it can't be part of the longest common substring.
separate each string into multiple strings containing only valid characters
for every such string, create every possible substring and add it to the list as well
filter (as with the characters) all strings, that don't show up in all lists.

The complexity of this obviously depends largely on the number of invalid characters. if it's zero, this approach doesn't help at all.

Some remarks on your code: Don't try to be overly clever. The compiler will optimize so much, there's really no need for you to put register in your code. Second, your allocating strings and then overwrite them (in readNumberSubstrings), that's totally unnecessary. Third, pass by const reference if you can. Fourth, don't use raw pointers, especially if you never delete [] your new []d objects. Use std::vectors instead, it behaves well with exceptions (which you might encounter, you're using strings a lot!).

Thanks) Well, I just happen to like 'register'. And yes, I see how re-writing my code could solve my problems, but I'll try your advices first. — Chiffa, Sep 20 '13 at 12:54
I've modified my function which used to overwrite strings and am now passing `string&`s instead of `string`s, and I'm measuring the time it takes for the code to work using `std::chrono` but, unless I'm measuring something really wrong, it's still slow. — Chiffa, Sep 20 '13 at 19:12
To be more precise, it takes about 3 seconds to compare 2 very long strings, and more than 20 for 10. — Chiffa, Sep 20 '13 at 19:19

score 0 · Answer 2 · answered Sep 20 '13 at 11:52

0

You have to use suffix tree. This struct will make algorithm, which work about 1 second for 10 string with 10000 symbols.

answered Sep 20 '13 at 11:52

Dmitrey_s

1
1

Using a suffix tree takes a lot of memory, as far as I know. – Chiffa Sep 20 '13 at 12:03
It is, no secret there. Am I to infer that yandex expects me to use a suffix tree? – Chiffa Sep 20 '13 at 12:13
Do the developers at Yandex go around calling each other stupid and then never explaining their own code? I've only met one Yandex developer and that was what he was like. – PP. Sep 20 '13 at 12:19
I recommend to read "Algorithms on String, Trees, and Sequences. Computer Science and Computational Biology". The book has your problem. – Dmitrey_s Sep 20 '13 at 12:21
Well, I wouldn't know, not being a Yandex developer. But this is straying from the coding problem. – Chiffa Sep 20 '13 at 12:25

score 0 · Answer 3 · answered Dec 16 '13 at 08:31

Give a Suffix Arraya try, they take as much memory as your input strings (depending on your text encoding though) and a built quickly in linear time.

http://en.wikipedia.org/wiki/Suffix_array

Here is my JavaScript code for this

function LCS(as, bs, A, B) {
    var a = 0, b = 0, R = [], max = 1
    while (a < A.length && b < B.length) {
        var M = cmpAt(as, bs, A[a], B[b])
        if (M.size > 0) {
            if (M.ab < 0) {
                var x = b; while (x < B.length) {
                    var C = cmpAt(as, bs, A[a], B[x])
                    if (C.size >= M.size) {  if (C.size >= max) max = C.size, R.push([a, x, C.size]) } else break
                    x++
                }
            } else {
                var x = a; while (x < A.length) {
                    var C = cmpAt(as, bs, A[x], B[b])
                    if (C.size >= M.size) { if (C.size >= max) max = C.size, R.push([x, b, C.size]) } else break
                    x++
                }
            }
        }
        if (M.ab < 0) a++; else b++
    }
    R = R.filter(function(a){ if (a[2] == max) return true })
    return R
}

function cmpAt(a, b, x, y) {
    var c = 0
    while (true) {
        if (x == a.length) {
            if (y == b.length) return { size: c, ab: 0 }
            return { size: c, ab: -1 }
        }
        if (y == b.length) return { size: c, ab: 1 }
        if (a.charCodeAt(x) != b.charCodeAt(y)) {
            var ab = 1; 
            if (a.charCodeAt(x) < b.charCodeAt(y)) ab = -1
            return { size: c, ab: ab }
        }
        c++, x++, y++
    }
}

Optimisation ideas - Longest common substring

3 Answers3

Linked