To understand the logic behind the KMP algorithm , you should first understand , how this KMP algo is better than brute-force algorithm .
Idea
After a shift of the pattern, the naive algorithm has forgotten all information about previously matched symbols. So it is possible that it re-compares a text symbol with different pattern symbols again and again. This leads to its worst case complexity of Θ(nm) (n: length of the text, m: length of the pattern).
The algorithm of Knuth, Morris and Pratt [KMP 77] makes use of the information gained by previous symbol comparisons. It never re-compares a text symbol that has matched a pattern symbol. As a result, the complexity of the searching phase of the Knuth-Morris-Pratt algorithm is in O(n).
However, a preprocessing of the pattern is necessary in order to analyze its structure. The preprocessing phase has a complexity of O(m). Since m<=n, the overall complexity of the Knuth-Morris-Pratt algorithm is in O(n).
text :bacbababaabcbab
pattern:abababca
In brute-force method ,
Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches .
void search(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
/* A loop to slide pat[] one by one */
for (int i = 0; i <= N - M; i++)
{
int j;
/* For current index i, check for pattern match */
for (j = 0; j < M; j++)
{
if (txt[i+j] != pat[j])
break;
}
if (j == M) // if pat[0...M-1] = txt[i, i+1, ...i+M-1]
{
printf("Pattern found at index %d \n", i);
}
}
}
The complexity of above algorithm is O(nm).
In the above algorithm we never used comparison data we processed,
Bacbababaabcbab //let I be iterating variable over this text
Abababca //j be iterating variable over pattern
When i=0 and j=0 there is a mismatch (text[i+j]!=pat[j]), we increment i until there is a match .
When i =4 , there is a match(text[i+j]==pat[j]), increment j till we find mismatch (if j= patternlength we found pattern) ,in the given example we find mismatch at j=5 when i=4 , a mismatch happens at idex 4+5=9 in text. The sub string that matched is ababa ,
**
Why we need to choose longest proper prefix which is also proper suffix :
**
From the above : we see that mismatch happened at 9 where pattern ended with substring ababa .
Now if we want to take advantage over the comparisons we have done so far then we can skip (increment) i more than 1 then the numbers of comparisons will be reduced leading to better time complexity.
Now what advantage we can take on processed comparison data “ababa” .
If we see carefully: the prefix aba of string ababa is compared with text and matched, same is the case with suffix aba. But there is a mismatch ‘b’ with ‘a’
Bacbababaabcbab
||||||
||||| x
|||| ||
ababab
But according to naïve approach, we increment i to 5. But we know by looking at it, we can set i =6 as next occurrence of aba occurs at 6. So instead of comparing with each and every element in text we preprocess the pattern for finding the longest proper prefix which is also proper suffix (which is called border). In the above example for ‘ababa’ ,and length of longest border is 3 (which is aba) . So increment by: length of substring – length of longest border => 5-3 =2.
If our comparison fails at aba , then length of longest border is 1 and j=3 so increment by 2 .
For more on how to preprocess : http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080
http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm