0

I have a small problem. I'm solving one programming task, but have a problem with it. It is simple one, but time limit make it a bit harder.

Find number of occurrences of substring. You will be given M - length of substring; substring to find, N - length of base string; base string.
M <= 100 000
N<= 200 000

Input

10
budsvabbud
79
uaahskuskamikrofonubudsvabbudnebudlabutkspkspkspmusimriesitbudsvabbudsvabbudnel

Output
3

I tried to use using build-in function find,but it wasn't fast enough:

#include<iostream>
#include<string>

using namespace std;

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    cin >> n >> to_find >> n >> base_string;
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    cout << occurrences << endl;
}

So I tried to write my own function, but it was even slower:

#include<iostream>
#include<cstdio>
#include<string>
#include<queue>

using namespace std;

int main()
{
    int n, m;
    string to_find;
    queue<int> rada;  
    int occurrences = 0;
    cin >> m >> to_find >> n;
    for (int i = 0; i < n; i++)
    {
        char c;
        scanf(" %c", &c);
        int max = rada.size();
        for (int j = 0; j < max; j++)
        {
            int index = rada.front();
            rada.pop();
            if (c == to_find[index])  
            {
                if (++index == m) {
                    occurrences++;
                }
                else
                    rada.push(index);
            }
        }
        if (c == to_find[0])
        {
            if (1 == m)
                n++;
            else
                rada.push(1);
        }
    }
    cout << occurrences << endl;

}

I know some people did this in 0 ms, but my first code needs more than 2000 ms and the second one a lot more than that. Have you any ideas how to solve this? Thanks.

EDIT: Limits of length:

M <= 100 000 - length of substring

N<= 200 000 - lenght of base string

Jozef Bugoš
  • 137
  • 1
  • 2
  • 8

3 Answers3

2

The algorithm you presented is an O(M*N), where N is the length of the text and M is the length of the searched world. Usually, also the libraries implement the naive algorithm. However, there is an algorithm by Knuth, Morrison and Pratt, which does it in a O(M+N) time. See, e.g., Wikipedia Knuth-Morrison-Pratt Algorithm. It has some variations which might be easier to implement like Boyer-Moore-Horsepool.

Ari Hietanen
  • 1,749
  • 13
  • 15
  • This. If the string you're searching in is 200k of `aaaa...` and the string you're searching for is 100k of `aaaa...b` (99999 a's then 1 b) then the complexity of your search is going to be 200000 * 100000 which is huge. This is because, using any naive implementation, you'll try to match your search string at position 1, then matching 99999 a's then fail to match a b. You then move to position 2 and repeart the failure. At each position in the search string you have to do 99999 matches. KMP solves this problem nad has a compexity O(m+n) instead of O(m * n). – Mike Vine Dec 14 '15 at 15:12
1

Safe version

static size_t findOccurences(const char * const aInput, const char * const aDelim)
{
    if (aInput == 0x0 || aDelim == 0x0)
    {
        throw std::runtime_error("Argument(s) null");
    }

    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;

    if (delimLength <= inputLength && delimLength > 0)
    {
        size_t delimIndex = 0;

        for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
        {
            if (aInput[inputIndex] != aDelim[delimIndex])
            {
                delimIndex = 0;
            }
            else
            {
                delimIndex++;

                if (delimIndex == delimLength)
                {
                    delimIndex = 0;
                    result++;
                }
            }
        }
    }

    return result;
}

Unsafe version

static size_t unsafeFindOccurences(const char * const aInput, const char * const aDelim)
{
    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;
    size_t delimIndex = 0;

    for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
    {
        if (aInput[inputIndex] != aDelim[delimIndex])
        {
            delimIndex = 0;
        }
        else
        {
            delimIndex++;

            if (delimIndex == delimLength)
            {
                delimIndex = 0;
                result++;
            }
        }
    }

    return result;
}

Results safe

          x86        x64
Debug     5501ms     5813ms
Release   3889ms     3998ms

Results unsafe

          x86        x64
Debug     5442ms     5564ms
Release   3074ms     3139ms

Compiled with Visual Studio 2015, Visual Studio 2015 (v140) toolset under Windows 10 x64 Pro.

Using this input. Searching for 'ad' and 1.000.000 iterations.

Neijwiert
  • 985
  • 6
  • 19
0

I try this code in debug mode without any optimization and it took 11 mSec. VS.NET 2013 , Intel Core i7:

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    base_string.reserve(200000);
    to_find.reserve(100000);
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('a');
    }
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('b');
    }
    for (size_t i = 0; i < 100000; i++){
        to_find.push_back('b');
    }
    auto start_s = clock();
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    auto stop_s = clock();
    std::cout << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000;
    cout << occurrences << endl;
    std::getchar();
}

There is a problem in compiler, configuration, your machine, but in your code.

Humam Helfawi
  • 19,566
  • 15
  • 85
  • 160
  • This is task from website and I sent this code to be checked there as well. I just got a result - Time limit exceeded - and a time which my code needed :1.in OK 0 ms 2.in OK 0 ms 3.in OK 0 ms 4.in OK 1 ms 5.in OK 164 ms 6.in OK 45 ms 7.in TLE 1811 ms – Jozef Bugoš Dec 14 '15 at 14:38
  • Try making `base_string` 200,000 'a's and the `to_find` string 99,999 'a's and then a single 'b'. Then remeasure... – Mike Vine Dec 14 '15 at 15:17
  • I did this.. and it went from 01 to 14440 – Mike Vine Dec 14 '15 at 15:21