Find number of occurrences of substring

Question

I have a small problem. I'm solving one programming task, but have a problem with it. It is simple one, but time limit make it a bit harder.

Find number of occurrences of substring. You will be given M - length of substring; substring to find, N - length of base string; base string.
M <= 100 000
N<= 200 000

Input

10
budsvabbud
79
uaahskuskamikrofonubudsvabbudnebudlabutkspkspkspmusimriesitbudsvabbudsvabbudnel

Output
3

I tried to use using build-in function find,but it wasn't fast enough:

#include<iostream>
#include<string>

using namespace std;

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    cin >> n >> to_find >> n >> base_string;
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    cout << occurrences << endl;
}

So I tried to write my own function, but it was even slower:

#include<iostream>
#include<cstdio>
#include<string>
#include<queue>

using namespace std;

int main()
{
    int n, m;
    string to_find;
    queue<int> rada;  
    int occurrences = 0;
    cin >> m >> to_find >> n;
    for (int i = 0; i < n; i++)
    {
        char c;
        scanf(" %c", &c);
        int max = rada.size();
        for (int j = 0; j < max; j++)
        {
            int index = rada.front();
            rada.pop();
            if (c == to_find[index])  
            {
                if (++index == m) {
                    occurrences++;
                }
                else
                    rada.push(index);
            }
        }
        if (c == to_find[0])
        {
            if (1 == m)
                n++;
            else
                rada.push(1);
        }
    }
    cout << occurrences << endl;

}

I know some people did this in 0 ms, but my first code needs more than 2000 ms and the second one a lot more than that. Have you any ideas how to solve this? Thanks.

EDIT: Limits of length:

M <= 100 000 - length of substring

N<= 200 000 - lenght of base string

@HumamHelfawi Sorry, I forgot to write it. I will edit my question. — Jozef Bugoš, Dec 14 '15 at 14:21
Have you enabled optimization? Do you really mean 2000 milliseconds - I'd be surprised at even a debug build taking that long. — Martin Bonner supports Monica, Dec 14 '15 at 14:25
For fast searching in large files, something like Boyer-Moore is probably an order of magnitude (or more) faster - but it's not likely to be worth it for 79 characters. — Martin Bonner supports Monica, Dec 14 '15 at 14:26
Yeah, I mean 2000 milliseconds. But not for input I gave as example. There could be input with length 200 000. — Jozef Bugoš, Dec 14 '15 at 14:30
Aahh. Boyer-Moore *is* going to be worth it for 200K characters - but I don't know if it works well for such huge search strings — Martin Bonner supports Monica, Dec 14 '15 at 14:32
It might be worth doing a Boyer-Moore search for the first 1000 characters, and then a brute-force comparison for the rest. — Martin Bonner supports Monica, Dec 14 '15 at 14:33

score 2 · Accepted Answer · answered Dec 14 '15 at 15:03

2

The algorithm you presented is an O(M*N), where N is the length of the text and M is the length of the searched world. Usually, also the libraries implement the naive algorithm. However, there is an algorithm by Knuth, Morrison and Pratt, which does it in a O(M+N) time. See, e.g., Wikipedia Knuth-Morrison-Pratt Algorithm. It has some variations which might be easier to implement like Boyer-Moore-Horsepool.

answered Dec 14 '15 at 15:03

Ari Hietanen

1,749
13
15

This. If the string you're searching in is 200k of `aaaa...` and the string you're searching for is 100k of `aaaa...b` (99999 a's then 1 b) then the complexity of your search is going to be 200000 * 100000 which is huge. This is because, using any naive implementation, you'll try to match your search string at position 1, then matching 99999 a's then fail to match a b. You then move to position 2 and repeart the failure. At each position in the search string you have to do 99999 matches. KMP solves this problem nad has a compexity O(m+n) instead of O(m * n). – Mike Vine Dec 14 '15 at 15:12

Neijwiert · Answer 2 · 2015-12-14T14:59:11.657

Safe version

static size_t findOccurences(const char * const aInput, const char * const aDelim)
{
    if (aInput == 0x0 || aDelim == 0x0)
    {
        throw std::runtime_error("Argument(s) null");
    }

    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;

    if (delimLength <= inputLength && delimLength > 0)
    {
        size_t delimIndex = 0;

        for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
        {
            if (aInput[inputIndex] != aDelim[delimIndex])
            {
                delimIndex = 0;
            }
            else
            {
                delimIndex++;

                if (delimIndex == delimLength)
                {
                    delimIndex = 0;
                    result++;
                }
            }
        }
    }

    return result;
}

Unsafe version

static size_t unsafeFindOccurences(const char * const aInput, const char * const aDelim)
{
    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;
    size_t delimIndex = 0;

    for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
    {
        if (aInput[inputIndex] != aDelim[delimIndex])
        {
            delimIndex = 0;
        }
        else
        {
            delimIndex++;

            if (delimIndex == delimLength)
            {
                delimIndex = 0;
                result++;
            }
        }
    }

    return result;
}

Results safe

          x86        x64
Debug     5501ms     5813ms
Release   3889ms     3998ms

Results unsafe

          x86        x64
Debug     5442ms     5564ms
Release   3074ms     3139ms

Compiled with Visual Studio 2015, Visual Studio 2015 (v140) toolset under Windows 10 x64 Pro.

Using this input. Searching for 'ad' and 1.000.000 iterations.

score 0 · Answer 3 · answered Dec 14 '15 at 14:33

I try this code in debug mode without any optimization and it took 11 mSec. VS.NET 2013 , Intel Core i7:

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    base_string.reserve(200000);
    to_find.reserve(100000);
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('a');
    }
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('b');
    }
    for (size_t i = 0; i < 100000; i++){
        to_find.push_back('b');
    }
    auto start_s = clock();
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    auto stop_s = clock();
    std::cout << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000;
    cout << occurrences << endl;
    std::getchar();
}

There is a problem in compiler, configuration, your machine, but in your code.

This is task from website and I sent this code to be checked there as well. I just got a result - Time limit exceeded - and a time which my code needed :1.in OK 0 ms 2.in OK 0 ms 3.in OK 0 ms 4.in OK 1 ms 5.in OK 164 ms 6.in OK 45 ms 7.in TLE 1811 ms — Jozef Bugoš, Dec 14 '15 at 14:38
Try making `base_string` 200,000 'a's and the `to_find` string 99,999 'a's and then a single 'b'. Then remeasure... — Mike Vine, Dec 14 '15 at 15:17

Find number of occurrences of substring

3 Answers3

Safe version

Unsafe version

Results safe

Results unsafe