10

In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?

For example, take the following PCRE regex:

^(?:(z)?(?(1)aa|a)){2}

When fed the string zaazaa, it matches the whole string, as desired. But when fed zaaaa, I would like it to match zaaa; instead, it matches zaaaa, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2} but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)

An alternative way of doing this, which also doesn't work as desired:

^(?:(?:z()|())(?:\1aa|\2a)){2}

Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:

^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))

So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z is non-empty:

^(?:(z?)(?(?!.*$\1)aa|a)){2}

Or just using an emulated conditional:

^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}

I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).

It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *, ?, or {0,N} quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:

\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b

But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1 on every iteration of the loop:

^(?:(?=a|(b)).(?(1)_))*$

and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_.

I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*), i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.

To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.

Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.

oguz ismail
  • 1
  • 16
  • 47
  • 69
Deadcode
  • 860
  • 9
  • 15
  • 2
    I believe `\K` will tell the regex engine to clear all capture groups, but I don't understand what you are trying to do here. – Tim Biegeleisen Jan 04 '19 at 23:51
  • The only mistake you were doing in the first Regex of the question was you were asking it to capture first group 2 times, which was aa. So I removed it, Let whole group capture and then let it repeat if you want or at least one time. – Deep Jan 04 '19 at 23:54
  • @Deep Thanks but you did misunderstand my question. The example I gave was just a toy example. I want to be able to erase capture groups while staying inside a loop and continuing to loop. I only gave it `{2}` repetitions to make it a very simple example; in practice, I'd mostly be using this on unbounded loops like `(...)+` and `(...)*` where `...` means whatever would go inside the loop. – Deadcode Jan 05 '19 at 00:10
  • Can you put example string somewhere where we can play with the data. It would be easier for me to understand. – Deep Jan 05 '19 at 00:16
  • @Deep I'll try, but it's not any particular example task that matters in this case, it's the *way* it's done. There is no task that *requires* this, it's just that being able to erase a capture could make certain tasks doable in a more elegant way. – Deadcode Jan 05 '19 at 01:01
  • @Tim Biegeleisen `\K` just changes where the final match begins, and does not affect the contents of capture groups at all. I don't actually care about the final match in this example; it's only an example to differentiate and demonstrate/explain what I want to do inside the loop. I want to erase the capture group during a loop, while staying in the loop. – Deadcode Jan 05 '19 at 05:24

4 Answers4

5

I found this documented in PCRE's man page, under "DIFFERENCES BETWEEN PCRE2 AND PERL":

   12.  There are some differences that are concerned with the settings of
   captured strings when part of  a  pattern  is  repeated.  For  example,
   matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
   unset, but in PCRE2 it is set to "b".

I'm struggling to think of a practical problem that cannot be better solved with an alternative solution, but in the interests of keeping it simple, here goes:

Suppose you have a simple task well-suited to being solved by using forward references; for example, check the input string is a palindrome. This cannot be solved generally with recursion (due to the atomic nature of subroutine calls), and so we bang out the following:

/^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$/

Easy enough. Now suppose we are asked to verify that every line in the input is a palindrome. Let's try to solve this by placing the expression in a repeated group:

\A(?:^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$(?:\n|\z))+\z

Clearly that doesn't work, since the value of \2 persists from the first line to the next. This is similar to the problem you're facing, and so here are a number of ways to overcome it:

1. Enclose the entire subexpression in (?!(?! )):

\A(?:(?!(?!^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$)).+(?:\n|\z))+\z

Very easy, just shove 'em in there and you're essentially good to go. Not a great solution if you want any particular captured values to persist.

2. Branch reset group to reset the value of capture groups:

\A(?|^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$|\n()()|\z)+\z

With this technique, you can reset the value of capture groups from the first (\1 in this case) up to a certain one (\2 here). If you need to keep \1's value but wipe \2, this technique will not work.

3. Introduce a group that captures the remainder of the string from a certain position to help you later identify where you are:

\A(?:^(?:(.)(?=.*(\1(?(2)(?=\2\3\z)\2))([\s\S]*)))*+.?\2$(?:\n|\z))+\z 

The whole rest of the collection of lines is saved in \3, allowing you to reliably check whether you have progressed to the next line (when (?=\2\3\z) is no longer true).

This is one of my favourite techniques because it can be used to solve tasks that seem impossible, such as the ol' matching nested brackets using forward references. With it, you can maintain any other capture information you need. The only downside is that it's horribly inefficient, especially for long subjects.

4. This doesn't really answer the question, but it solves the problem:

\A(?![\s\S]*^(?!(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$))

This is the alternative solution I was talking about. Basically, "re-write the pattern" :) Sometimes it's possible, sometimes it isn't.

jaytea
  • 1,861
  • 1
  • 14
  • 19
  • 1
    As a side note, the difference part is not specific to PCRE2. – revo Jan 15 '19 at 09:44
  • +1, Very nice answer (not technically an answer but still useful), with a great example of a problem that would benefit from capture group erasing, and at least one method of working around the issue I hadn't thought of. And interesting thing to know about Perl; `/^(?:(z)?(?(1)aa|a)){2}/` actually works the way I want in it (differently from PCRE). However, the alternative version, `/^(?:(?:(z)|)(?(1)aa|a)){2}/`, works the same in Perl and PCRE (which is not the way I want). – Deadcode Jan 15 '19 at 09:52
  • Note that the enclosing-in-`(?!`...`)` trick often doesn't work in Perl. It lets capture groups' contents leak out of a negative lookahead in some circumstances. – Deadcode Aug 04 '22 at 06:12
5

With PCRE (and all as I'm aware) it's not possible to unset a capturing group but using subroutine calls since their nature doesn't remember values from the previous recursion, you are able to accomplish the same task:

(?(DEFINE)((z)?(?(2)aa|a)))^(?1){2}

See live demo here

If you are going to implement a behavior into your own regex flavor to unset a capturing group, I'd strongly suggest do not let it happen automatically. Just provide some flags.

revo
  • 47,783
  • 14
  • 74
  • 117
  • This is indeed a viable method in some cases, but the downside is that you also can't return any captures from a subroutine call in PCRE. The choice is either between capturing the entirety of what the subroutine matches, or not capturing it. – Deadcode Jan 18 '19 at 20:01
3

This is partially possible in .NET's flavour of regex.

The first thing to note is that .NET records all of the captures for a given capture group, not just the latest. For instance, ^(?=(.)*) records each character in the first line as a separate capture in the group.

To actually delete captures, .NET regex has a construction known as balancing groups. The full format of this construction is (?<name1-name2>subexpression).

  • First, name2 must have previously been captured.
  • The subexpression must then match.
  • If name1 is present, the substring between the end of the capture of name2 and the start of the subexpression match is captured into name1.
  • The latest capture of name2 is then deleted. (This means that the old value could be backreferenced in the subexpression.)
  • The match is advanced to the end of the subexpression.

If you know you have name2 captured exactly once then it can readily be deleted using (?<-name2>); if you don't know whether you have name2 captured then you could use (?>(?<-name2>)?) or a conditional. The problem arises if you might have name2 captured more than once since then it depends on whether you can organise enough repetitions of the deletion of name2. ((?<-name2>)* doesn't work because * is equivalent to ? for zero-length matches.)

Neil
  • 54,642
  • 8
  • 60
  • 72
0

There is also another way to "erase" capture groups in .NET. Unlike the (?<-name>) method, this empties the group instead of deleting it – so instead of not matching, it will then match an empty string.

In .NET, groups with the same name can be captured multiple times, even if that name is a number. This allows PCRE expressions using balanced groups to be ported to .NET. Consider this PCRE pattern:

(?|(pattern)|())

Assuming both groups are \1 above, then using this technique, in .NET it would become:

(?:(pattern)|(?<1>))

I used this technique today to make a 38 byte .NET regex that matches strings whose length is a fourth power:

^((?=(?>^((?<3>\3|x))|\3(\3\2))*$)){2}

the above is a port of the following 35 byte PCRE regex, which uses balanced groups:

^((?=(?|^((\2|x))|\2(\2\3))*+$)){2}

(In this example, the capture group isn't actually being emptied. But this technique can be used to do anything a balanced group can do, including emptying a group.)

Deadcode
  • 860
  • 9
  • 15