In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?
For example, take the following PCRE regex:
^(?:(z)?(?(1)aa|a)){2}
When fed the string zaazaa
, it matches the whole string, as desired. But when fed zaaaa
, I would like it to match zaaa
; instead, it matches zaaaa
, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2}
but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)
An alternative way of doing this, which also doesn't work as desired:
^(?:(?:z()|())(?:\1aa|\2a)){2}
Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:
^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))
So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z
is non-empty:
^(?:(z?)(?(?!.*$\1)aa|a)){2}
Or just using an emulated conditional:
^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}
I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).
It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *
, ?
, or {0,N}
quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:
\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b
But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1
on every iteration of the loop:
^(?:(?=a|(b)).(?(1)_))*$
and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa
. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_
.
I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*)
, i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.
To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.
Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.