6

I understand that putting ?: inside of the start of the parentheses of a regular expression will prevent it from creating a backreference, which is supposed to be faster. My question is, why do this? Is the speed increase noticeable enough to warrant this consideration? Under what circumstances is it going to matter so much that you need to carefully skip the backreference each time you are not going to use it. Another disadvantage is that it makes the regex harder to read, edit, and update (if you end up wanting to use a backreference later).

So in summary, why bother not creating a backreference?

Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
  • As with anything in programming, speed on a small set is never worth worrying over. If you are running this regex on megabytes of text, then the difference will be large. – Travis Webb Mar 14 '11 at 01:59
  • @Travis, some poorly implemented regular expression engines do exponential backtracking which can be really slow on even small inputs. I ran into this problem porting some perl to python. Python has since fixed a lot of problems with its `re` module, but nevertheless, the failure modes you tend to see with regex corner cases can be O(2**n) worst case. – Mike Samuel Mar 14 '11 at 02:06
  • @Mike I've heard of horrible backtrack implementations, but how do you reach O(2^n)? Where n = ? – Travis Webb Mar 14 '11 at 02:10

2 Answers2

13

I think you're confusing backreferences like \1 and capturing groups (...).

Backreferences prevent all kinds of optimizations by making the language non-regular.

Capturing groups make the regular expression engine do a little more work to remember where a group starts and ends, but are not as bad as backreferences.

http://www.regular-expressions.info/brackets.html explains capturing groups and back references to them in detail.

EDIT:

On backreferences making regular expressions non-regular, consider the following regular expression which matches lua comments:

/^--(?:\[(=*)\[[\s\S]*?(?:\]\1\]|$)|[^\r\n]*)/

So --[[...]] is a comment, --[=[...]=] is a comment, --[==[...]==] is a comment. You can nest comments by adding extra equals signs between the square brackets.

This cannot be matched by a strictly regular language, so a simple finite state machine cannot handle it in O(n) time -- you need a counter.

Perl 5 regular expressions can handle this using back-references. But as soon as you require non-regular pattern matching, your regular expression library has to give up the simple state-machine approach and use more complex, less-efficient code.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Nice. +1 for providing the correct solution to a question he wasn't even able to ask correctly. – Travis Webb Mar 14 '11 at 02:12
  • This is not a good answer, and I did not confuse the creation of a backreference with a capture group. This answer doesn't answer the question at all. I asked why force the regex to prevent the creation of a backreference (with the capture group). There is no confusion. As for the answer, the second paragraph has shades of a good response, but you provide no explanation or example. What does it mean to make the language non-regular? I don't care about the comparison of capture groups to backreferences, I am talking exclusively about skipping backreference creation – Explosion Pills Mar 14 '11 at 03:07
  • 1
    @tandu, I don't know what "force the regex to prevent the creation of a backreference". A regex contains capturing groups. Usually only capturing groups 1 through 9 can be referenced as `$1`...`$9` in substitution string so that's one reason not to have all parenthetical groups be capturing groups. A back reference is not the same as a capturing group. A back reference is a sequence that appears in a regular expression (as opposed to in a replacement string) that refers back to a capturing group and perl 5 uses the `\1`...`\9` syntax for them. – Mike Samuel Mar 14 '11 at 20:49
  • What do you mean “usually only groups 1-9 can be referenced as `$1`⋯`$9` in a substitution”? Perl certainly lets you use as many numbered groups as you please, so you can have say, a `$388` if you are so mercilessly inclined. Do other languages impose arbitrary restrictions on that sort of thing? – tchrist Apr 28 '11 at 01:21
  • @tchrist, Some examples. From http://www.grymoire.com/Unix/Regular.html#uh-10 "You can recall the remembered pattern with "\" followed by a single digit. Therefore, to search for two identical letters, use "\([a-z]\)\1". You can have 9 different remembered patterns." Python allows up to 99: http://docs.python.org/library/re.html . I believe Java is unlimited. JavaScript allows more than 9 but I have not tested the limits. – Mike Samuel Apr 28 '11 at 05:16
7

You're right, performance is not the only reason to avoid capturing groups--in fact, it's not even the most important reason.

Another disadvantage is that it makes the regex harder to read, edit, and update (if you end up wanting to use a backreference later).

I look at it the other way around: if you habitually use non-capturing groups, it's easier to keep track of the group numbers on those occasions when you do choose to capture something. In the same vein, if you're using named groups (assuming your regex flavor supports them), you should always use named groups, and always refer to them (in backreferences or replacement strings) by name, not by number. Following these rules consistently will at least partially offset the readability penalty of non-capturing groups.

Yes, it is a PITA having to clutter up your regexes that way, and the people who write/maintain the regex implementations know it. In .NET you can set the ExplicitCapture option whereby all "bare" parentheses are treated as non-capturing groups, and only named groups capture. In Perl 6, parentheses (with or without names) always capture, and square brackets are used for non-capturing groups. The other flavors will probably follow suit eventually, but in the meantime we just have to rely on good habits.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • The problem with the perl5 syntax is that it is messy to do what you want to do, which is use a lot of `(?:⋯)` for simple unnamed grouping and use `(?<ɴᴀᴍᴇ>⋯)` for named captures and `\k<ɴᴀᴍᴇ>` for named backrefs. Despite being much much better, those are all a lot wordier/noisier than `(⋯)`, `\1`, and `$1`. – tchrist Apr 28 '11 at 01:24