4

These are equivalent:

grep -E '^A|bA'
grep -P '^A|bA'
grep -P '(?<![^b])A'

But the second one, grep -P '^A|bA', is multiple times slower. Why?

They all find the same thing: a line with an A at the beginning or after a b. (Equivalently, a line with an A not preceded by anything other than a b.)

Is the second line disabling some optimization? Does grep check multiple characters in parallel when it thinks that's faster? I can't come up with another explanation, unless the ^ or | means something subtly different in perl.

ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
mgiuffrida
  • 3,299
  • 1
  • 26
  • 27
  • 1
    Could you show some sample input and benchmark data? It runs in roughly the same time for me, approx 0.001 seconds per run. I tried running it 10000x and got 9.3 vs 9.5 seconds, which is about the same, not multiple times slower. – xxfelixxx Aug 01 '16 at 04:34
  • 1
    `perl -le'print "bcA" x 10 for 1..10_000' >neg` generates a file that exhibits the behaviour the OP mentions – ikegami Aug 01 '16 at 06:01
  • 1
    Short answer: In order to support the extra features of Perl "regular" expressions, a different, less efficient engine must be used. [See this](http://arstechnica.com/civis/viewtopic.php?f=20&t=1195549) – ikegami Aug 01 '16 at 06:07
  • I'm running this on the Chrome source tree but any collection of text files seems to work. The command by @ikegami works too (after upping to `1_000_000`) – mgiuffrida Aug 01 '16 at 11:21

2 Answers2

6

GNU egrep (grep -E) uses a DFA engine if the pattern contains no backreferences*; grep -P uses PCRE's NFA implementation. DFA engines never backtrack, while the pattern ^A|bA can trigger lots of inefficient backtracking with PCRE.

PCRE checks for ^A, then bA, at every single position in the string until it finds a match. For large input that doesn't match until late in the string (or at all), this can take a long time.

You can see this with the pcretest utility:

$ pcretest
PCRE version 8.32 2012-11-30

  re> /^A|bA/C
data> bcAbcAbcA
--->bcAbcAbcA
 +0 ^             ^
 +1 ^             A
 +3 ^             b
 +4 ^^            A
 +0  ^            ^
 +3  ^            b
 +0   ^           ^
 +3   ^           b
 +0    ^          ^
 +3    ^          b
 +4    ^^         A
 +0     ^         ^
 +3     ^         b
 +0      ^        ^
 +3      ^        b
 +0       ^       ^
 +3       ^       b
 +4       ^^      A
 +0        ^      ^
 +3        ^      b
 +0         ^     ^
 +3         ^     b
No match

(?<![^b])A is faster because instead of testing for a match at every position, PCRE skips directly to the first A; if that doesn't match, it skips to the next A, and so on until the end of the string:

  re> /(?<![^b])A/C
data> bcAbcAbcA
--->bcAbcAbcA
 +0   ^           (?<![^b])
 +4   ^      ^    [^b]
 +8   ^           )
 +0      ^        (?<![^b])
 +4      ^   ^    [^b]
 +8      ^        )
 +0         ^     (?<![^b])
 +4         ^^    [^b]
 +8         ^     )
 +0          ^    (?<![^b])
 +4          ^    [^b]
 +8          ^    )
No match

For details about the differences between DFA and NFA implementations, see Russ Cox's article "Regular Expression Matching Can Be Simple And Fast".


* According to "DFA Speed with NFA Capabilities: Regex Nirvana?" on page 182 of Jeffrey Friedl's Mastering Regular Expressions.

ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
-1

The reason grep might not perform as well when using the -P option is in part due to using a different regex ( pcre ) engine, which is more complex in the algorithms it uses. A quick look under the hood reveals what takes place ( GNU grep 2.20 ):

-E--extended-regexp

Starting program: /usr/bin/grep -Eq \^A\|bA wordlist.dic
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, 0x0000000000402ec0 in main ()
(gdb) si 200000
0x000000000040c667 in dfacomp ()
(gdb) bt
#0  0x000000000040c667 in dfacomp ()
#1  0x000000000040d618 in GEAcompile ()
#2  0x000000000040328a in main ()
(gdb) si 200000
0x00007ffff78410db in memchr () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff78410db in memchr () from /lib64/libc.so.6
#1  0x000000000040f952 in kwsexec ()
#2  0x000000000040dc2d in EGexecute ()
#3  0x000000000040500a in grepbuf ()
#4  0x0000000000405b60 in grepdesc ()
#5  0x000000000040330f in main ()
(gdb) si 200000
[Inferior 1 (process 23706) exited normally]

-P --perl-regexp

Starting program: /usr/bin/grep -Pq \^A\|bA wordlist.dic
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, 0x0000000000402ec0 in main ()
(gdb) si 200000
0x00007ffff7835eed in _int_malloc () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7835eed in _int_malloc () from /lib64/libc.so.6
#1  0x00007ffff783826c in malloc () from /lib64/libc.so.6
#2  0x00007ffff7ba0fe4 in sljit_create_compiler () from /lib64/libpcre.so.1
#3  0x00007ffff7bbb32f in _pcre_jit_compile () from /lib64/libpcre.so.1
#4  0x00007ffff7bbfd8d in pcre_study () from /lib64/libpcre.so.1
#5  0x000000000041000e in Pcompile ()
#6  0x000000000040328a in main ()
(gdb) si 1200000
0x00007ffff7bbdc31 in _pcre_jit_exec () from /lib64/libpcre.so.1
(gdb) bt
#0  0x00007ffff7bbdc31 in _pcre_jit_exec () from /lib64/libpcre.so.1
#1  0x00007ffff7b9f083 in pcre_exec () from /lib64/libpcre.so.1
#2  0x0000000000410372 in Pexecute ()
#3  0x000000000040500a in grepbuf ()
#4  0x0000000000405b60 in grepdesc ()
#5  0x000000000040330f in main ()
... and still going ...

As you can see there's much more going on when calling upon the pcre engine rather than what's built-in. Essentially when using this regex option grep is required to do more than four times the instructions to search for the same pattern.

l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • 1
    The engine being "external" versus "native" (what does that even mean?) is irrelevant. The depth of the stack trace is also irrelevant. What matters is the type of engine used (DFA vs NFA), check the other answer. A DFA engine will do its job in O(n), with n = size of input. A NFA engine's execution time depends on the pattern's complexity. The price you have to pay for a DFA execution is a reduced feature set. – Lucas Trzesniewski Aug 01 '16 at 18:19
  • 2
    There is no additional process here. `pcre_exec` is just the entry point of the PCRE engine, nothing to do with `exec`. And `_pcre_jit_exec` is an internal function which executes native machine code that PCRE compiled in-memory from your pattern. – Lucas Trzesniewski Aug 01 '16 at 18:35
  • @LucasTrzesniewski: If it's an internal function of grep then why would `/lib64/libpcre.so.1` show up at all in the backtrace? – l'L'l Aug 01 '16 at 18:38
  • It's an internal function of PCRE, not grep. And it doesn't matter *where* this function comes from. What ultimately matters is *what* the algorithm does. – Lucas Trzesniewski Aug 01 '16 at 18:45
  • @ikegami: I think that's where most of the confusion is with my answer apparently. Also, do you have a source regarding what you mentioned about the library only loading once in the process ( i was just curious ). And what do you mean "deeper nesting"? ... thx – l'L'l Aug 02 '16 at 04:06
  • @ikegami: By stepping into the program using the differing arguments it's obvious there's more instructions; I don't understand what "deeper nesting" is either... – l'L'l Aug 02 '16 at 04:13
  • I misunderstood what your snippets were demonstrating. – ikegami Aug 02 '16 at 04:47
  • Showing that more instructions are being executed just supports that it's a CPU-bound function, but noone should be surprised by that. I don't think anyone thought `-P` is slower because of disk usage, thread locking, etc., making it obvious that more instructions are being executed. The question is why. – ikegami Aug 02 '16 at 04:52
  • Where is it obvious that more instructions are being executed unless you debug program by some means? – l'L'l Aug 02 '16 at 05:05