2

Can anybody help me with optimization of my LONGEST COMMON SUBSTRING problem? I must read really big files (up to 2 Gb), but i cant figure out which structure to use... In c++ there is no hash maps.. There is concurrent hash map in TBB but it is very complicated to use with this algorithm. I have this problem solved with **L matrix but it is greedy and cannot be used for large inputs. Matrix is full of zeros, and that can be eliminated by i.e. using map> and store only non-zeros but that is really slow and practicaly unusable. Speed is very important. Here is the code :

// L[i][j] will contain length of the longest substring
    // ending by positions i in refSeq and j in otherSeq
    size_t **L = new size_t*[refSeq.length()];
    for(size_t i=0; i<refSeq.length();++i)
        L[i] = new size_t[otherSeq.length()];

    // iteration over the characters of the reference sequence
    for(size_t i=0; i<refSeq.length();i++){
        // iteration over the characters of the sequence to compare
        for(size_t j=0; j<otherSeq.length();j++){
            // if the characters are the same,
            // increase the consecutive matching score from the previous cell
            if(refSeq[i]==otherSeq[j]){
                if(i==0 || j==0)
                    L[i][j]=1;
                else
                    L[i][j] = L[i-1][j-1] + 1;
            }
            // or reset the matching score to 0
            else
                L[i][j]=0;
        }
    }

    // output the matches for this sequence
    // length must be at least minMatchLength
    // and the longest possible.
    for(size_t i=0; i<refSeq.length();i++){
        for(size_t j=0; j<otherSeq.length();j++){

            if(L[i][j]>=minMatchLength) {
                //this sequence is part of a longer one
                if(i+1<refSeq.length() && j+1<otherSeq.length() && L[i][j]<=L[i+1][j+1])
                    continue;
                //this sequence is part of a longer one
                if(i<refSeq.length() && j+1<otherSeq.length() && L[i][j]<=L[i][j+1])
                    continue;
                //this sequence is part of a longer one
                if(i+1<refSeq.length() && j<otherSeq.length() && L[i][j]<=L[i+1][j])
                    continue;
                cout << i-L[i][j]+2 << " " << i+1 << " " << j-L[i][j]+2 << " " << j+1 << "\n";

                // output the matching sequences for debugging :
                //cout << refSeq.substr(i-L[i][j]+1,L[i][j]) << "\n";
                //cout << otherSeq.substr(j-L[i][j]+1,L[i][j]) << "\n";
            }
        }
    }
vanste25
  • 1,754
  • 14
  • 39
  • My, my. No hashmaps in c++? [Surprising that is](http://stdcxx.apache.org/doc/stdlibref/map.html) – Voo May 07 '12 at 15:36
  • ok, then tell me what is the name of that structure? There is unordered_map in C++0x but i want c++ structure. – vanste25 May 07 '12 at 16:00
  • 1
    Sorry I didn't know that the newest c++ standard did no longer count as c++. Although every compiler I know of offered hashmaps for several years now. – Voo May 07 '12 at 17:17
  • It counts, but i dont use it. Map is slow structure, with ordered elements in it. I need something real fast. – vanste25 May 07 '12 at 18:51

1 Answers1

0

There is a Intel Contest about the same problem.

Maybe they will post some solutinons when it's over

http://software.intel.com/fr-fr/articles/AYC-early2012_home/