27

I am currently learning about pattern matching algorithms and have come across these two algorithms. I have the following general ideas:

KMP

  • Compares text left-to-right
  • Uses a failure array to shift intelligently
  • takes O(m), where m is the length of the pattern, to compute failure array
  • takes O(m), space
  • takes O(n), time to search a string

BM

  • Compares pattern from last character
  • Uses bad character jumps and good suffix jumps
  • takes O(m + size of alphabet) to compute tables
  • takes O(m + size of alphabet), space
  • takes O(n), but usually better to search

I came across the following question which triggered this question(True or False):

The Knuth-Morris-Pratt (KMP) algorithm is a good choice if we want to search for the same pattern repeatedly in many different texts.

So I believe the answer is true just because the assumption is that every time you run the algorithm on different text the preprocessing is only O(n) where for BM it is O(n + size of alphabet). However, I am not sure if I am making the correct assumption that every time the algorithm is rerun a new table is recomputed. Because say the text always falls in the alphabet of english. I would only need to compute the table once and just reuse the table. So at the end of the day, would the answer to this question be dependent on the fact that the algorithms are all being run on text which is contained in the same alphabet or is there some other factor which may affect it?

Eric
  • 1,356
  • 2
  • 14
  • 24
  • 1
    Lots of information here: http://stackoverflow.com/q/12656160/56778, and in other SO posts. Do a Google search for [kmp vs boyer-moore]. – Jim Mischel Apr 18 '13 at 14:09
  • @JimMischel I saw that post already, but it does not directly answer the main part of my question. And I already tried to Google it – Eric Apr 18 '13 at 15:32
  • 1
    This is exactly what I'm looking for. Any help would be appreciated. – J-Y Apr 18 '13 at 15:37

2 Answers2

23

In theory, both algorithms will have "similar" performance; KMP will do about 2n comparisons in the searching phase and Boyer-Moore will do about 3n comparisons in the searching phase in the worst case. In neither case do you need to repeat the preprocessing when you get a new text.

But the real answer is that you shouldn't use either one in practice.

The linear auxiliary storage needed by both algorithms leads to considerably...rougher performance on modern architectures because of all of the extra memory accesses.

However, the ideas behind Boyer-Moore and KMP underpin most fast string matching algorithms. Something like KMP's "failure function" idea is used by every practically effective string matching algorithm I know of; it turns out that you can compute a suboptimal "failure function" for a pattern on-the-fly that still gives you linear time matching while only needing constant additional space. Boyer-Moore is faster than linear in the "average case" of matching a fixed pattern against random noise, and this bears itself out in many practical situations.

tmyklebu
  • 13,915
  • 3
  • 28
  • 57
  • 1
    It's worth noting that C++'s Boost has both matchers and they work rather well. – user541686 Apr 18 '13 at 17:06
  • 1
    @Mehrdad: Constant-space KMP variants beat the pants off straight KMP, though. Whether Boyer-Moore beats that or not generally depends on your input. – tmyklebu Apr 18 '13 at 17:23
  • 6
    Interesting answer, but it'd be great if you could say which algorithm you actually should use in practice, if not KMP or BM. – Milo Wielondek Nov 14 '16 at 18:49
  • 3
    @0sh: "Two-way string matching," due to Crochemore and Perrin, is effective both in theory and in practise. It is an improvement upon "string matching using maximum suffixes," which is also fairly fast in practise; I am not sure who to attribute that algorithm to. – tmyklebu Nov 16 '16 at 04:52
0

An old post, I know but I couldn't let it stand... Never, use KMP in any form if you are after speed. It has the virtue of being consistent in it's timings no matter the input but it is always slow compared to other algorithms. It has a great historical significance and is useful as a teaching aid, but apart from that... no. Have a look here and try to decide which algorithm matches your use case https://arxiv.org/pdf/1012.2547v1.pdf failing that, BM or BMH will probably be your best best.

Pete
  • 1
  • 1