13

I would also like to know which algorithm has the worst case complexity of all for finding all occurrences of a string in another. Seems like Boyer–Moore's algorithm has a linear time complexity.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Ouais Alsharif
  • 317
  • 1
  • 4
  • 16

4 Answers4

13

The KMP algorithm has linear complexity for finding all occurrences of a pattern in a string, like the Boyer-Moore algorithm¹. If you try to find a pattern like "aaaaaa" in a string like "aaaaaaaaa", once you have the first complete match,

aaaaaaaaa
aaaaaa
 aaaaaa
      ^

the border table contains the information that the next longest possible match (corresponding to the widest border of the pattern) of a prefix of the pattern is just one character short (a complete match is equivalent to a mismatch one past the end of the pattern in this respect). Thus the pattern is moved one place further, and since from the border table it is known that all characters of the pattern except possibly the last match, the next comparison is between the last pattern character and the aligned text character. In this particular case (find occurrences of am in an), which is the worst case for the naive matching algorithm, the KMP algorithm compares each text character exactly once.

In each step, at least one of

  • the position of the text character compared
  • the position of the first character of the pattern with respect to the text

increases, and neither ever decreases. The position of the text character compared can increase at most length(text)-1 times, the position of the first pattern character can increase at most length(text) - length(pattern) times, so the algorithm takes at most 2*length(text) - length(pattern) - 1 steps.

The preprocessing (construction of the border table) takes at most 2*length(pattern) steps, thus the overall complexity is O(m+n) and no more m + 2*n steps are executed if m is the length of the pattern and n the length of the text.

¹ Note that the Boyer-Moore algorithm as commonly presented has a worst-case complexity of O(m*n) for periodic patterns and texts like am and an if all matches are required, because after a complete match,

aaaaaaaaa
aaaaaa
 aaaaaa
      ^
  <- <-
 ^

the entire pattern would be re-compared. To avoid that, you need to remember how long a prefix of the pattern still matches after the shift following a complete match and only compare the new characters.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
4

If you think about it, the worst case for matching the pattern is the one in which you've to visit each index of the LPS array, when mismatch occurs. For example, pattern "aaaa" which creates LPS arrays as [0,1,2,3] makes it possible.

Now, for the worst case matching in the text, we want to maximize the such mismatches that forces us to visit all the indices of the LPS array. That would be a text with repeated pattern, but with the last character as a mismatch. For example, "aaabaaacaaabaaacaaabaaac".

Let the length of the text be n and that of pattern be m. Number of the occurences of such pattern in the text is n/m. And for each of these occurences, we are performing m comparisions. Not to forget that we are also traversing n characters of the text.

Therefore, the worst case time for KMP matching would be O(n + (n/m)*m), which is basically O(n).

Total worst case time complexity, including LPS creation, would be O(n+m).

KMP Code (for reference):

void createLPS(char[] pattern,int[] lps){
        int m = pattern.length;
        int i=1;
        int j=0;
        lps[j]=0;
        while(i<m){
            if(pattern[j]==pattern[i]){
                lps[i]=j+1;
                i++;
                j++;
            }else{
                if(j!=0){
                    j = lps[j-1];
                }else{
                    lps[i]=0;
                    i++;
                }
            }
        }
    }

 List<Integer> match(char[] str, char[] pattern, int[] lps){
        int m = pattern.length;
        int n = str.length;
        int i=0, j=0;
        List<Integer> idxs = new ArrayList<>();
        while(i<n){
            if(pattern[j]==str[i]){
                j++;
                i++;
            }else{
                if(j!=0){
                    j = lps[j-1];
                }else{
                    i++;
                }
            }
            if(j==m){
                idxs.add(i-m);             
                j = lps[j-1];
            }
        }
        return idxs;
    }
Aryan
  • 430
  • 4
  • 12
4

There is a long article on KMP at http://en.wikipedia.org/wiki/Knuth-morris-pratt which ends with saying

Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k).

These complexities are the same, no matter how many repetitive patterns are in W or S. (end quote)

So the total cost of a KMP search is linear in the number of characters of string and pattern. I think this holds even if you need to find multiple occurrences of the pattern in the string - and if not, just consider searching for patternQ, where Q is a character that does not occur in the text, and noting down where the KMP state shows that it has matched everything up to the Q.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • This is not very clear. say I want to use KMP to find the occurrences of "aaa" in "aaaaa" wouldn't KMP need to do n*m comparisons to find all the occurrences? – Ouais Alsharif Feb 07 '12 at 19:58
  • It would do O(3+8) which means (3+8)*some constant – kilotaras Feb 07 '12 at 20:21
  • KMP avoids comparisons by remembering how many characters have been matched so far. After seeing and matching aaa it knows that the last 3 characters of the string to be searched are aaa, so when it sees that this is followed by another a it knows that this, too, is a match with the last three characters including the new match aaa. This is not in the Wikipedia code, which returns at the match. If you use the aaaQ trick, KMP will know that it has aaa, and should go to the state representing aa, find that the character != Q is a, and then go to the state aaa again. – mcdowella Feb 08 '12 at 04:38
3

You can count Pi function for a string in O(length). KMP builds a special string that has length n+m+1, and counts Pi function on it, so in any case complexity will be O(n+m+1)=O(n+m)

kilotaras
  • 1,419
  • 9
  • 24