Reasoning behind shifting over the text whem mismatch occurs in KMP algorithm?

Question

I have been trying to understand KMP algorithm. Still I didn't get the clear understanding of reasoning behind kmp algorithm. Suppose my text is bacbababaabcbab and pattern is abababca. By using the rule of length of longest proper prefix of sub(pattern) that matches the proper suffix of sub(pattern), I filled my table[].

a b a b a b c a
0 0 1 2 3 4 0 1

Now I started applying KMP algorithm on the text with my pattern and table.

After coming to index 4 of above text, we have a match of length(l)=5; by looking at table[l-1]=3; As per KMP algorithm we can skip length up to 2 chars and can continue .

bacbababaabcbab
----xx|||
abababca

Here I am not getting the logic behind shifting. Why should we shift? Can somebody please clarify my confusion?

This question appears to be off-topic because it is more suited to http://cs.stackexchange.com/ since it is asking about an algorithm rather than a specific implementation. — Raymond Chen, Sep 14 '13 at 19:38

score 5 · Answer 1 · answered Jul 15 '14 at 08:47

To understand the logic behind the KMP algorithm , you should first understand , how this KMP algo is better than brute-force algorithm .

Idea

After a shift of the pattern, the naive algorithm has forgotten all information about previously matched symbols. So it is possible that it re-compares a text symbol with different pattern symbols again and again. This leads to its worst case complexity of Θ(nm) (n: length of the text, m: length of the pattern).

The algorithm of Knuth, Morris and Pratt [KMP 77] makes use of the information gained by previous symbol comparisons. It never re-compares a text symbol that has matched a pattern symbol. As a result, the complexity of the searching phase of the Knuth-Morris-Pratt algorithm is in O(n).

However, a preprocessing of the pattern is necessary in order to analyze its structure. The preprocessing phase has a complexity of O(m). Since m<=n, the overall complexity of the Knuth-Morris-Pratt algorithm is in O(n).

text :bacbababaabcbab pattern:abababca

In brute-force method , Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches .

void search(char *pat, char *txt)
{
    int M = strlen(pat);
    int N = strlen(txt);

    /* A loop to slide pat[] one by one */
    for (int i = 0; i <= N - M; i++)
    {
        int j;

        /* For current index i, check for pattern match */
        for (j = 0; j < M; j++)
        {
            if (txt[i+j] != pat[j])
                break;
        }
        if (j == M)  // if pat[0...M-1] = txt[i, i+1, ...i+M-1]
        {
           printf("Pattern found at index %d \n", i);
        }
    }
}

The complexity of above algorithm is O(nm). In the above algorithm we never used comparison data we processed,

Bacbababaabcbab   //let I be iterating variable over this text

Abababca    //j be iterating variable over pattern

When i=0 and j=0 there is a mismatch (text[i+j]!=pat[j]), we increment i until there is a match . When i =4 , there is a match(text[i+j]==pat[j]), increment j till we find mismatch (if j= patternlength we found pattern) ,in the given example we find mismatch at j=5 when i=4 , a mismatch happens at idex 4+5=9 in text. The sub string that matched is ababa , **

Why we need to choose longest proper prefix which is also proper suffix :

** From the above : we see that mismatch happened at 9 where pattern ended with substring ababa . Now if we want to take advantage over the comparisons we have done so far then we can skip (increment) i more than 1 then the numbers of comparisons will be reduced leading to better time complexity.
Now what advantage we can take on processed comparison data “ababa” . If we see carefully: the prefix aba of string ababa is compared with text and matched, same is the case with suffix aba. But there is a mismatch ‘b’ with ‘a’

Bacbababaabcbab
         ||||||            
         ||||| x
        |||| ||
        ababab

But according to naïve approach, we increment i to 5. But we know by looking at it, we can set i =6 as next occurrence of aba occurs at 6. So instead of comparing with each and every element in text we preprocess the pattern for finding the longest proper prefix which is also proper suffix (which is called border). In the above example for ‘ababa’ ,and length of longest border is 3 (which is aba) . So increment by: length of substring – length of longest border => 5-3 =2.
If our comparison fails at aba , then length of longest border is 1 and j=3 so increment by 2 .

For more on how to preprocess : http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080 http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm

haskile · Answer 2 · 2013-09-14T19:45:23.307

I am not sure that you've got problems with understanding in only this point, so, if you don't mind, I'll just describe (with as much explanation as possible) the whole algorithm. The answer to your question is probably in the last paragraph, but you'd better read it all to understand my terminology better.

During the KMP algorithm, you are, actually, counting nearly the same values as in the table (this is usually called prefix function). So when you get to position i in the text, you need to count the maximum length of a substring in the text ending in position i which equals some prefix of the pattern. It is quite clear that if and only if the length of this substring is equal to length of the pattern, you have found the pattern in the text. So, how do you count this prefix function value fast? (I suppose that you count these for the pattern using some O(n^2) algorithm, which is not enough fast). Let's suppose that we've already did everything for the first i-1 symbols of the text and we are now working with position i. We will also need the prefix-function value for the previous symbol of the text: p[i-1].

Let's compare text[i] and pattern[p[i-1]] (indexation from 0, if you don't mind). We already know that pattern[0:p[i-1]-1] == text[i-1+p[i-1],i-1]: that's the definition of p[i-1]. So, if text[i] == pattern[p[i-1]], we now know that pattern[0:p[i-1]] == text[i-1+p[i-1], i]', and that's why p[i] = p[i - 1]. But the interesting part starts when text[i] != pattern[p[i-1]].

When these symbols are different, we start jumping. The reason for that is that we want to find the next possible prefix as fast as we can. So, how we do it. Just look at the picture here and follow the explanation (the yellow parts are substrings found for text[i-1]). We're trying to find some string s: s:=s1+text[i]. Because of the prefix-function definition, s1=s2, c=test[i]. But we already know (from finding the value for text[i-1]) that two yellow parts in the picture are the same, so s3 actually equals s1, and so s3=s1. So we can find the length of s1: it is table[p[i-1]]. So now if c1=text[i], we should stop: we've found p[i], it is s1.length + 1. And if c1!=text[i], we can just repeat the same jumping, looking now at the first table[table[p[i-1]]] symbols of the pattern, and so we go on until we find the answer, or we get to the first 0 symbols in this case p[i]:=0.

Reasoning behind shifting over the text whem mismatch occurs in KMP algorithm?

2 Answers2

Linked