9

Are optional non-capturing groups redundant?

Is the following regex:

(?:wo)?men

semantically equivalent to the following regex?

(wo)?men
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
fredoverflow
  • 256,549
  • 94
  • 388
  • 662
  • I think this would depend on where you are using the regex. Java's standard regex strings might require it, whilst I am fairly sure Perls would consider it redundant. – thecoshman Jul 19 '15 at 10:59
  • 5
    non-capturing groups are heavy on processor(since it requires extra processing) while capturing groups are heavy on memory(since it has to store many things). But, they're semantically equivalent in a sense that they can match the same thing(s) but in a different way. You can think these as cars having different engines, but both serves as a mean of riding. – Muhammad Imran Jul 19 '15 at 15:30
  • 1
    Hi, if you think the answer below works for you, please consider accepting. – Wiktor Stribiżew May 06 '19 at 07:29
  • 2
    @MuhammadImran Could you provide a reference for the claim that *non-capturing groups are heavy on processor*? – Christopher Schultz Jun 19 '20 at 10:10
  • If my answer below worked for you please consider accepting the answer. – Wiktor Stribiżew Oct 27 '21 at 11:28

2 Answers2

12

Your (?:wo)?men and (wo)?men are semantically equivalent, but technically are different, namely, the first is using a non-capturing and the other a capturing group. Thus, the question is why use non-capturing groups when we have capturing ones?

Non-caprturing groups are of help sometimes.

  1. To avoid excessive number of backreferences (remember that it is sometimes difficult to use backreferences higher than 9)
  2. To avoid the problem with 99 numbered backreferences limit (by reducing the number of numbered capturing groups) (source: Regular-expressions.info: Most regex flavors support up to 99 capturing groups and double-digit backreferences.)
    NOTE this does not pertain to Java regex engine, nor to PHP or .NET regex engines.
  3. To lessen the overhead caused by storing the captures in the stack
  4. We can add more groupings to existing regex without ruining the order of capturing groups.

Also, it is just makes our matches cleaner:

You can use a non-capturing group to retain the organisational or grouping benefits but without the overhead of capturing.

It does not seem a good idea to re-factor existing regular expressions to convert capturing to non-capturing groups, since it may ruin the code or require too much effort.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Just note that the 99 backreference limit does not pertain to Java regex engine. The number of capturing groups in Java is stored in *transient int capturingGroupCount*, so, theoretically, there can be a lot of backreferences, but there number can be capped by the memory limitations. – Wiktor Stribiżew Jul 18 '16 at 15:12
  • I'm trying to find how much overhead we are talking about and what impact this actually has (Java and Javascript). Is there any real benefit in using non-capturing groups in terms of performance? – runlevel0 Aug 31 '17 at 14:34
  • 1
    @runlevel0: I did not measure that, but have seen some comment where people claim there is some minute discrepancy in favor of non-capturing groups. However, there are situations when capturing groups are abused inside quantified groups inside a pattern (not in the final pattern position). In those cases, the regex engine works hard on trying to resize/reset the captured value each time backtracking goes into the repeated capturing group, and in case the text chunk the repeated capturing group matches is huge, performance issues might already become noticeable. – Wiktor Stribiżew Aug 31 '17 at 14:52
  • 1
    See [this PHP answer](https://stackoverflow.com/a/39688067/3832970). There have been reported similar issues with C++ `std::regex`. – Wiktor Stribiżew Aug 31 '17 at 14:53
  • 1
    @runlevel0 For JavaScript there is generally a minor performance improvement for non-capturing groups: https://jsperf.com/regex-capture-vs-non-capture – Paul Wagland Dec 07 '17 at 14:08
0

A question elsewhere was asking the same and I provided an answer with an example in Python:

It doesn't "have the same effect" - in one case the group is captured and accessible, in the other it is only used to complete the match.

People use non-capturing groups when they are not interesting in accessing the value of the group - to save space for situations with many matches, but also for better performance in cases where the regex engine is optimised for it.

A useless example in Python to illustrate the point:

from timeit import timeit
import re

chars = 'abcdefghij'
s = ''.join(chars[i % len(chars)] for i in range(100000))


def capturing():
    re.findall('(a(b(c(d(e(f(g(h(i(j))))))))))', s)


def noncapturing():
    re.findall('(?:a(?:b(?:c(?:d(?:e(?:f(?:g(?:h(?:i(j))))))))))', s)


print(timeit(capturing, number=1000))
print(timeit(noncapturing, number=1000))

Output:

5.8383678999998665
1.0528525999998237

Note: this is in spite of PyCharm (if you happen to use it) warning "Unnecessary non-capturing group" - the warning is correct, but not the whole truth, clearly. It's logically unneeded, but definitely does not have the same practical effect.

If the reason you wanted to get rid of them was to suppress such warnings, PyCharm allows you to do so with this:

# noinspection RegExpUnnecessaryNonCapturingGroup
re.findall('(?:a(?:b(?:c(?:d(?:e(?:f(?:g(?:h(?:i(j))))))))))', s)

Another note for the pedantic: the examples above aren't strictly logically equivalent either. But they match the same strings, just with different results.

c = re.findall('(a(b(c(d(e(f(g(h(i(j))))))))))', s)
nc = re.findall('(?:a(?:b(?:c(?:d(?:e(?:f(?:g(?:h(?:i(j))))))))))', s)

c is a list of 10-tuples ([('abcdefghij', 'bcdefghij', ..), ..]), while nc is a list of single strings (['j', ..]).

Grismar
  • 27,561
  • 4
  • 31
  • 54