GNU egrep (grep -E
) uses a DFA engine if the pattern contains no backreferences*; grep -P
uses PCRE's NFA implementation. DFA engines never backtrack, while the pattern ^A|bA
can trigger lots of inefficient backtracking with PCRE.
PCRE checks for ^A
, then bA
, at every single position in the string until it finds a match. For large input that doesn't match until late in the string (or at all), this can take a long time.
You can see this with the pcretest
utility:
$ pcretest
PCRE version 8.32 2012-11-30
re> /^A|bA/C
data> bcAbcAbcA
--->bcAbcAbcA
+0 ^ ^
+1 ^ A
+3 ^ b
+4 ^^ A
+0 ^ ^
+3 ^ b
+0 ^ ^
+3 ^ b
+0 ^ ^
+3 ^ b
+4 ^^ A
+0 ^ ^
+3 ^ b
+0 ^ ^
+3 ^ b
+0 ^ ^
+3 ^ b
+4 ^^ A
+0 ^ ^
+3 ^ b
+0 ^ ^
+3 ^ b
No match
(?<![^b])A
is faster because instead of testing for a match at every position, PCRE skips directly to the first A
; if that doesn't match, it skips to the next A
, and so on until the end of the string:
re> /(?<![^b])A/C
data> bcAbcAbcA
--->bcAbcAbcA
+0 ^ (?<![^b])
+4 ^ ^ [^b]
+8 ^ )
+0 ^ (?<![^b])
+4 ^ ^ [^b]
+8 ^ )
+0 ^ (?<![^b])
+4 ^^ [^b]
+8 ^ )
+0 ^ (?<![^b])
+4 ^ [^b]
+8 ^ )
No match
For details about the differences between DFA and NFA implementations, see Russ Cox's article "Regular Expression Matching Can Be Simple And Fast".
* According to "DFA Speed with NFA Capabilities: Regex Nirvana?" on page 182 of Jeffrey Friedl's Mastering Regular Expressions.