0

I have large list of substrings (more than 100000) to count occurrences in large string (few hundred kb). The most common algo for cpp that i found in the internet is:

size_t countSubstring(const std::string& str, const std::string& sub) {
    if (sub.length() == 0) return 0;
    size_t count = 0, l = sub.length();
    for (size_t offset = str.find(sub); offset != std::string::npos;
    offset = str.find(sub, offset + l))
    {
        ++count;
    }

    return count;
}

But it's too slow for my purposes. Is there any faster way to do that?

P.S. Also tried KMP algo, but it even slower.

std::vector<size_t> prefix_function(const std::string& s) {
    size_t n = (size_t)s.length();
    std::vector<size_t> pi(n);
    pi[0] = 0;

    size_t j;
    for(size_t i=1; i<n; ++i) {
        j = pi[i-1];
        while(j>0 && s[i]!=s[j])
            j = pi[j-1];
        if(s[i]==s[j]) ++j;
        pi[i] = j;
    }
    return pi;
}


size_t count_using_KMP(const std::string& S, const std::string& pattern, size_t start) {

    std::vector<size_t> prefix = prefix_function(pattern);

    size_t counter=0, l=pattern.length(), k=0;
    for(size_t i=start; i<S.length(); ++i) {
        while((k>0) && (pattern[k]!=S[i])) {
            k = prefix[k-1];
        }
        if(pattern[k]==S[i]) {
            k++;
        }
        if(k==l) {
            counter++;
            k=0;
        }
    }
    return counter;
}
EgorPuzyrev
  • 119
  • 1
  • 8
  • Do the substrings have some interesting characteristics that can be exploited? – Vikhram Apr 05 '17 at 16:23
  • Just an idea: Did you try `strstr` on the `str.c_str()`? – mch Apr 05 '17 at 16:25
  • 1
    Using a text index is magnitudes faster, but you have to create an index first. That wouldn't be worthwhile for only a few kB. – mike Apr 05 '17 at 16:27
  • 2
    As a quick recap of answers from that other question - do not search for each substring individually. There are algorithms to search for all substrings in single run which are much more efficient for such cases, like Aho-Corasick. – Andrey Turkin Apr 05 '17 at 16:28
  • What is the *actual* problem you want to solve by this solution? *Why* do you need to find every possible sub-string? Maybe we can help you with that problem instead? Propose other data-structures or algorithms? Also please take some time to read about [the XY problem](http://xyproblem.info/). – Some programmer dude Apr 05 '17 at 16:30
  • You should try representing the substrings as a (Trie)[https://en.wikipedia.org/wiki/Trie] and then match the `Trie` to the main string. This way, you will no be repeating the substring search 100K times (one for each substring). Also, partial string indexing on the string to be searched will provide you some relief – Vikhram Apr 05 '17 at 16:33
  • @Vikhram no, just bytes sequencies. Representing substring as Trie and then search looks similar to pre-KMP. – EgorPuzyrev Apr 05 '17 at 17:19
  • @mch no, i didn't try, but this is not solution anyway - algos both of string::find, strstr non-optimized for counting – EgorPuzyrev Apr 05 '17 at 17:19
  • @mike yep, but what kind of index? There is a lot of them: suffix/prefix trees/arrays and so on – EgorPuzyrev Apr 05 '17 at 17:19
  • @AndreyTurkin Thx, looks good. Where i can find list of that sort of algos (not only data structures)? – EgorPuzyrev Apr 05 '17 at 17:19
  • @Someprogrammerdude I don't need to find _every_ possible substring. I've already found those of i need and now i want to count their occurences. That's the actual problem. – EgorPuzyrev Apr 05 '17 at 17:19
  • @EgorPuzyrev There are more, which have better performance guarantees (e.g. FM-Index). Just look at some succinct data structures/text indices and what their runtime and space complexity is. This is a state of the art C++ library: https://github.com/simongog/sdsl-lite – mike Apr 05 '17 at 17:29

0 Answers0