1

The Boyer Moore algorithm has a a preprocessing time of Θ(m + |Σ|) and a matching time of Ω(n/m), O(n). I understand that Boyer Moore Horspool is an advancement of Simplified Boyer Moore itself, however its average case complexity is O(N) and worst case O(MN) according to this Wikipedia article. So in the worst case it should be slower than the Boyer Moore algorithm. But this classic survey by University of Chile shows that the Boyer-Moore horspool outperforms Boyer Moore almost every time. I am confused! Which one should I use (for small as well as large patterns) for string searching and which algorithm has a greater significance in the practical world (I am just a Computer science student)?

Anthony
  • 12,177
  • 9
  • 69
  • 105
Ritesh Mahato
  • 621
  • 9
  • 15
  • 2
    It's a simple time vs. space tradeoff. One or the other may be better _for your requirements_, but neither is "better" in general. Why not experiment with a [well documented, well tested implementation](http://www.boost.org/libs/algorithm/) of each and profile to see what works best for _your_ data? – ildjarn Jul 12 '12 at 23:27

1 Answers1

4

The key word is "almost". The worst-case behavior can be for a vanishingly small number of cases. Average behavior in real life and asymptotic behavior are also rather loosely coupled. The best case behavior of Boyer-Moore-Horspool is the same as for Boyer-Moore. The worst case for Boyer-Moore-Horspool is quite a bit worse than for Boyer-Moore. For typical use, Boyer-Moore-Horspool tends to be about the same as Boyer-Moore, but with a little better (lower) overhead and initialization costs.

Which one to use? It depends on your goals and what you expect in the way of patterns and text to be searched. Neither is particularly hard to implement, so why not do both and compare the results yourself. (See what happens when you admit that you're a student? You get an assignment! :))

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • "Average behavior in real life and asymptotic behavior are also rather loosely coupled." That's a poor generalization. It'll murder you with quicksort with mostly-sorted data, for instance. – std''OrgnlDave Jul 12 '12 at 23:37
  • 2
    @std''OrgnlDave - All I meant what that asymptotic behavior can be a very poor predictor of "real life" behavior. Tell me to choose one algorithm over another because it has better asymptotic behavior and I'll first ask, "What's the coefficient"? Then I'll ask, "What's the average behavior for the problem domain?" For linear programming, interior point algorithms have polynomial complexity, but in practice, everyone still uses the (asymptotically exponential) simplex algorithm. Why? Because the of the coefficients; the break-even point is for problems so big that nobody could run them. – Ted Hopp Jul 12 '12 at 23:46
  • @TedHopp- Thanks :) I did try implementing both and saw almost no perfomance difference(maybe i didn't take cases worse-enough). But my actual doubt was _theoretically_ **BM** should be faster than ***BMH** but the simulation graphs [link](http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm) show a domination of **BMH* instead.So i was confused as to how did that happen. – Ritesh Mahato Jul 13 '12 at 20:54
  • 1
    @RiteshMahato Theoretically, BM is faster if a lot of suffix shifts are used. For English text and random input, that is rare. But if you search e.g. `ba^m` in `a^n`, BM blows BMH out of the water. – Daniel Fischer Jul 14 '12 at 18:42
  • Nit: interior point methods are used in practice for large, sparse LPs. – David Eisenstat Jan 29 '17 at 23:33
  • @DavidEisenstat - Good point. I shouldn't have been so categorical in my earlier comment – Ted Hopp Jan 30 '17 at 00:10