5

I know that there are fast string search algorithms, like Boyer–Moore and Knuth–Morris–Pratt, that have O(n+m) complexity, while the trivial solution would be O(n*m).

So does the implementation of strstr() for the most popular toolchains - gcc and Visual Studio - use these fast O(n) algorithms, or does it use the trivial solution?

sashoalm
  • 75,001
  • 122
  • 434
  • 781
  • 1
    What is the algorithm for string search that has O(n)? – this Apr 02 '14 at 08:09
  • @self. KMP has an O(m+n), m is pattern length – Jim Yang Apr 02 '14 at 08:12
  • Better algorithmic complexity doesn't always mean a faster execution in normal situations: http://blogs.msdn.com/b/oldnewthing/archive/2006/01/19/514834.aspx – Michael Burr Apr 02 '14 at 08:39
  • @MichaelBurr I'm assuming they can do something like the sort optimization, where they choose the algorithm according to the array size - bubble sort for smaller arrays, and qsort for bigger. – sashoalm Apr 02 '14 at 08:48

1 Answers1

5

GCC's runtime library uses Two-Way Algorithm which performs 2n-m text character comparisons in the worst case. It is O(n) complexity in the search phase but it needs a extra preprocessing phase which is O(m) complexity. You could find details on http://www-igm.univ-mlv.fr/~lecroq/string/node26.html about the algorithm.

AFAIK MSVC runtime is doing strstr in the most naive way, in O(n*m) complexity. But brute force doesn't need extra memory space, so it never raise a bad alloc exception. KMP need O(m) extra space, and Two-Way need a constant extra space.

What GCC is doing sounds just like using FFT to calculate multiplys. Looks extremely fast in paper, but really slow in practice. MSVC will use SIMD instructions in strstr when they are availiable so it's even faster in most case. I will choose the brute force approach with SIMD if I'm going to write my own library.

Aean
  • 759
  • 1
  • 6
  • 16
  • 1
    can you prove it? It seems msvc is a little bit faster than glib-c on my computer. – Jim Yang Apr 02 '14 at 08:30
  • 1
    @Jim Yang: Time Complexity is not equal to Excution Time. MSVC use SIMD instructions in `strstr` when it is availiable, while GCC do not. So MSVC performs better on short strings. It depends on length. – Aean Apr 02 '14 at 08:33
  • I'm not quite familiar with SIMD, but i compiled the code both in VS2010, using MSVC and extracted code from glib-c. The code is copied from a person's blog, and i didn't see the 'two way algorithm' in it(but i saw it at glib-c 2.19). So maybe i was wrong? They both find a two-character word at the last of a 500mb string at almost 0.2ms – Jim Yang Apr 02 '14 at 08:39
  • The source code of gcc can be browsed at https://github.com/mirrors/gcc. I tried finding strstr()'s implementation, but couldn't find it. Or is it actually a different project? I assume it's in glibc, but I thought glibc was part of the gcc project. – sashoalm Apr 02 '14 at 10:03