Merge two regexes with variable number of capture groups

Question

I'm trying to match either

(\S+)(=)([fisuo])

or

(\S+)(!)

And then have the results placed in a list (capture groups). All of my attempts result in extra, unwanted captures.

Here's some code:

#!/usr/bin/perl
#-*- cperl -*-
# $Id: test7,v 1.1 2023/04/10 02:57:12 bennett Exp bennett $
#

use strict;
use warnings;
use Data::Dumper;

foreach my $k ('debugFlags=s', 'verbose!') {
    my @v;

    # Below is the offensive looking code.  I was hoping for a regex
    # which would behave like this:

    if(@v = $k =~ m/^(\S+)(=)([fisuo])$/) {
      printf STDERR ("clownMatch = '$k' => %s\n\n", Dumper(\@v));
    } elsif(@v = $k =~ m/^(\S+)(!)$/) {
      printf STDERR ("clownMatch = '$k' => %s\n\n", Dumper(\@v));
    }

    @v = ();

    # This is one of my failed, aspirational matches.  I think I know
    # WHY it fails, but I don't know how to fix it.
    
    if(@v = $k =~ m/^(?:(\S+)(=)([fisuo]))|(?:(\S+)(!))$/) {
      printf STDERR ("hopefulMatch = '$k' => %s\n\n", Dumper(\@v));
    }
    printf STDERR "===\n";
}

exit(0);
__END__

Output:

clownMatch = 'debugFlags=s' => $VAR1 = [
          'debugFlags',
          '=',
          's'
        ];


hopefulMatch = 'debugFlags=s' => $VAR1 = [
          'debugFlags',
          '=',
          's',
          undef,
          undef
        ];


===
clownMatch = 'verbose!' => $VAR1 = [
          'verbose',
          '!'
        ];


hopefulMatch = 'verbose!' => $VAR1 = [
          undef,
          undef,
          undef,
          'verbose',
          '!'
        ];


===

There are more details in the code comments. The output is at the bottom of the code section. And the '!' character is just that. I'm not confusing it with some other not.

Update Mon Apr 10 23:15:40 PDT 2023:

With the wise input of several readers, it seems that this question decomposes into a few smaller questions.

Can a regex return a variable number of capture groups?

I haven't heard one way or the other.

Should one use a regex in this way, if it could?

Not without a compelling reason.

For my purposes, should I use a regex to create what is really a lexical-analyzer/parser?

No. I was using a regex for syntax checking and got carried away.

I learned a good deal, though. I hope moderators see fit to keep this post as a cautionary tale.

Everyone deserves points on this one, and can claim that they were robbed, citing this paragraph. @Schwern gets the points for being first. Thanks.

One has three captures, one has two captures. How should they be combined? — Schwern, Apr 10 '23 at 03:38
I was hoping for `@v` to be either length 2 or 3 depending on which sub-regex matched, Like the clownMatch examples in the output. — Erik Bennett, Apr 10 '23 at 03:41
Since you're matching two different things, it seems perfectly reasonable to have two different matches. Why do you want to combine them? — Schwern, Apr 10 '23 at 03:42
I don't know. I (mis)thought that it was the regex-ish thing to do. If it doesn't look like the clownWare I thought it did, I can live with that. It just *looks*... like it should be collapsible. Plus, I want to know how to do it. If it were just a matter of getting it to work, I'd have moved on hours ago. — Erik Bennett, Apr 10 '23 at 03:46
I changed the title to reflect the "variable number of capture groups". I was thinking of the regex like a `sub` in that it could return a variable length list. This may not be the case. — Erik Bennett, Apr 10 '23 at 04:17
Using a [branch reset](https://www.regular-expressions.info/branchreset.html) w/o *undef.* try e.g. [`^(\S+)(?|(=)([fisuo])|(!)())$`](https://regex101.com/r/V8CVgu/2) — bobble bubble, Apr 10 '23 at 08:09
"_Using a branch reset w/o undef_" -- I don't see how that improves the matter; there is still one extra capture — zdim, Apr 10 '23 at 17:47
"_Should one use a regex in this way_" -- sure, why not; just filter out `undef`s if you want, as I showed. We do it like that every day. It's not just regex, using code with it is OK :) — zdim, Apr 11 '23 at 08:03
More to the point, I added a regex parser to my answer. At this point I don't think that there is a simpler way to get what you ask for. Have a look. — zdim, Apr 11 '23 at 08:03
I think your conclusions in the updates (Mon Apr 10) are wrong, but I don't know how to efficiently react on them here. SO may not be the right place for such a discussion. — LanX, Apr 11 '23 at 15:34
@LanX (and anyone else): I'd be willing to believe that. Where do we go from here? — Erik Bennett, Apr 11 '23 at 19:39
@zdim Your updates have expanded my horizons. I'll need some time to read up on your answer. I've got some reading to do. — Erik Bennett, Apr 11 '23 at 19:47
For deep discussions i prefer www.perlmonks.org. SO is limited to have this question/answer structure. — LanX, Apr 11 '23 at 20:57
@ErikBennett "_...horizons ... read up_" The added parts lead to interesting stuff (grammars) and yes perhaps there'd be some reading and playing to do. But the one takeaway is -- combine regex and code. Pure-regex solutions are nice to have of course but there's no need to push it. Specially in Perl, where regex is so well integrated into the language that one can even do it all in a single statement (like `my @done = map {...} grep {...} /pattern/;`). That can be pushed too far though, as well :) — zdim, Apr 11 '23 at 23:42

zdim · Answer 1 · 2023-06-30T17:44:16.427

In an alternation the values for all captures are returned, even for those that weren't matched.

An easy way out is to filter out undef's from the return list

if ( my @v = grep { defined } $s =~ /^(?: (\S+)(=)([fisuo]) | (\S+)(!) )$/x )

There are other ways to build the regex as well but a straight-up alternation is just fine.

The question specifically asks how to conflate two (alternative) regex patterns into one in such a way so to get captures only for what is actually matched, without extra undef's. This is a good question in my opinion as it would often be nice to not have to clean up.

The usual alternation (p1 | p2) returns (in a list context or in @{^CAPTURE}) all indicated capture groups, as stated above. If p1 defines three capture groups and p2 two, in the end we get five; captures for the branch that matched and undefs for the other.

In short, I find that to get a "clean" set of true captures only, with a pure-regex, we need to parse with a grammar.^† While the builtin support (see DEFINE) can only match ("recognize") patterns, the Regexp::Grammars supports far more. A simple example is suitable

use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);  # Data::Dumper is in the core

my $grammar = do {
    use Regexp::Grammars;
    qr{ 
        <word> <symb> <val>?

        <nocontext:>
        <token: word>  [^=!]+  # or use \w in a character class with chars
                               # that are also allowed, like [\w-.] etc

        <token: symb>  = | !
        <token: val>   [fisuo]
    }x;
};

for my $s (qw(debugFlags=s verb!)) {
    if ($s =~ $grammar) { 
        dd \%/;              # hash %/ is populated with results
        say for values %/;   # just the results
        say '-'x60;
    }   
}

This prints

{ symb => "=", val => "s", word => "debugFlags" }
s
=
debugFlags
------------------------------------------------------------
{ symb => "!", word => "verb" }
!
verb
------------------------------------------------------------

The results aren't sorted so one may want to add a desired sorting criterion for the hash, or go through the individual hash elements.

The example in the question is very simple so a trivial grammar works for it, but if we imagine it growing to process options more comprehensively then the grammar would need to be more involved/structured. For example, while this is still simple

qr{
    <option>   # run the matching

    # Define the grammar
    <nocontext:>
    <token: option>     <opt_vals> | <opt_flag>

    <token: opt_vals>   <word> <symb_vals> <val>
    <token: opt_flag>   <word> <symb_flag>?

    <token: word>       [^=!:]+

    <token: symb_vals>  = | :
    <token: symb_flag>  !
    <token: val>        [fisuo]

}x;

it can be expanded more easily and it is more precise.

The aim of regex in this question is to check usage of Getopt::Long, a module for parsing command-line options, and there can be nothing following ! (negation for flag-type options). So symbols following names of options with values (= and :) are separated from !. There is of course a lot more in the library's syntax; this is a demo.

Please see the (seemingly endless) docs for the many, many Regexp::Grammars features, of which practically none are used here.

^† All else seems to suffer from extra undefs. The "branch reset" comes close but still returns the longest set of indicated capture groups (3 here) even when it matches a shorter branch, as I mentioned in the comment below; so we get undefs again. See the answer from @LanX for how to use this.

The conditional expression, for which I hoped that it might dodge the bullet, also sets up all capturing parentheses that it sees

for (qw(debugFlags=s verb!)) 
{
    if ( /^([^=!]+) (?(?==) (=)([fisuo]) | (!))$/x ) {
        say "Length: ", scalar @{^CAPTURE};
        say $_//'undef' for @{^CAPTURE};
    }
}

We duly get two undef printed in the second test. I use a lookahead for the condition specifically to try to avoid extra capture groups but of course all parens further in the expression get them, regardless of which ones match. So one can really do

    if ( /^([^=!]+) (=)? (?(2) ([fisuo]) | (!))$/x )

(with same results, good match and capture but with extra undefs)

That's one way to do it. I was hoping to learn more about the regex way. I upvoted this anyway, though, because I hadn't used `grep` in that way before, and I love to learn new things. — Erik Bennett, Apr 10 '23 at 03:57
@ErikBennett "_the regex way_" -- I can't think of a generic and straightforward way which won't return all introduced captured groups, even those that are `undef` as they didn't match (in another branch). The [branch reset pattern](https://perldoc.perl.org/perlre#(?%7Cpattern)) comes close but still returns the _longer_ capture group (3 here), even when it matches the shorter branch. (So if it matches `verbose!` it still returns three-long list of captures, one being `undef`). Then there are many ways to craft a regex to avoid alternation but then that depends on the particular pattern. — zdim, Apr 10 '23 at 04:19
This is going to take me some time to study. From the looks of it, it may catch `foo!s`, but I really need to read up on this all. I'll keep watching and asking. This is turning out to be a bigger deal that the one liner I was expecting. I love it. — Erik Bennett, Apr 11 '23 at 20:01
It might catch `bar=`, as well. But this should just be a matter of simple changes. Dang, I haven't needed this stuff since school. At the risk of dating myself, that was before the "Camel Book". — Erik Bennett, Apr 11 '23 at 20:16
@ErikBennett The actual regex depends on the exact use case and is a matter of choice, yes. With `\S+` you catch _everything_ and rely on backtracking with the what follows (`=` etc). With `\w+` it's missing `-` for sure, with `[\w-]+` perhaps some other character which can be expected in an option name. A nice choice is to list what _cannot_ be there, which is what I did -- but then that probably needs tweaking. (What other symbols can be used after the name? There is `:` for instance, so that should be added, for `[^=!:]+`. Etc) But that's the easy part, to craft the exact subpattern — zdim, Apr 11 '23 at 23:18
@ErikBennett Another point -- in both grammars I lumped symbols together, in one pattern, `=|!`, that's used for both cases. But you can change it so that some names (like `verb`) are paired with one set of symbols (`!`) and others (like `debug`) with others (`=`); then one can come up with more precise parsing (` ` etc, while `p2` has its own, etc) — zdim, Apr 11 '23 at 23:48
@ErikBennett Thinking of the good point you make (`!` cannot be followed by anything while `=` must be) I tweaked the second grammar a little, to fix that. Note how easy that is. With a direct regex it would be much messier (to capture a symbol but not allow anything to follow the ! symbol in particular, etc). Still just a basic demo... — zdim, Apr 12 '23 at 07:19

score 3 · Answer 2 · answered Apr 10 '23 at 03:42

3

We can use the following single regex pattern:

^(\S+)([!=])((?<==)[fisuo])?$

This says to match:

^ from the start of the string
(\S+) match and capture in $1 a non whitespace term
([!=]) match and capture in $2 either ! or =
((?<==)[fisuo])? then capture optionally in $3 a letter from fisuo the lookbehind (?<==) ensures this only matches for =
$ end of the string

Demo

answered Apr 10 '23 at 03:42

Tim Biegeleisen

502,043
27
286
360

Lookbehind! I was messing with lookaheads, but didn't want to publish my failed attempts. This still leaves a trailing `undef` on the 2 group ex. (`verbose!`), but it's certainly going to work. For yuks, is there anyway to have it (or any regex) return 2 or 3 (or variable number) groups? – Erik Bennett Apr 10 '23 at 03:54
@ErikBennett If you want an empty string instead of *undef* with this, an idea to make the pattern inside optional e.g. by [removing the `?` and adding *OR nothing* `|)`](https://regex101.com/r/5UjLJq/1) at the end. – bobble bubble Apr 10 '23 at 09:00
1

@bobblebubble "_If you want an empty string instead of undef..._" -- but that still doesn't give what the question asks, a list of actual captures from the branch that matched (either 2 or 3 in this case). Not undef's or bogus empty strings. This just doesn't answer that. – zdim Apr 10 '23 at 09:18

LanX · Answer 3 · 2023-04-10T14:04:44.580

3

All of my attempts result in extra, unwanted captures.

I'd go for the "branch reset" (?| pattern1 | pattern2 | ... ) like already suggested by @bobble_bubble (as comment only)

It's a generic solution to combine different patterns with groups, while resetting the capture-count.

Alas contrary to the docs he linked to, you'll still get undef slots at the end of the LISTs returned for patterns with less groups.

But if this really bothers you - personally I would keep them - you can safely filter them out with a grep {defined} like @zdim suggested.

That's safe since undef means a non-match and can't be confused with an empty match "".

Here the code covering your test cases

use v5.12.0;
use warnings;
use Data::Dump qw/pp ddx/;
use Test::More;

# https://stackoverflow.com/questions/75974097/merge-two-regexes-with-variable-number-of-capture-groups

my %wanted =
  (
   "debugFlags=s" => ["debugFlags", "=", "s"],
   "verbose!"     => ["verbose", "!"],
  );


while ( my ( $str, $expect) = each %wanted ) {
    my @got =
      $str =~ / (\S+)
                (?|
                    (=) ([fisuo]+)
                |
                    (!)
                )
              /x;

    ddx \@got;                          # with trailing undefs

    @got = grep {defined} @got;         # eliminate undefs

    is_deeply( \@got, $expect, "$str => ". pp(\@got));
}

done_testing();

-->

# branchreset.pl:25: ["debugFlags", "=", "s"]
ok 1 - debugFlags=s => ["debugFlags", "=", "s"]
# branchreset.pl:25: ["verbose", "!", undef]
ok 2 - verbose! => ["verbose", "!"]
1..2

strategic update

But again, I don't see the point in eliminating the undef slots at the end, since you will need to handle the different cases individually anyway.

And one day you might want to add patterns after the branch too. If branch-reset was really skipping the missing groups, that would change the numbering of trailing groups beyond recognition. So from a design perspective that's well done.

edited Apr 10 '23 at 14:04

answered Apr 10 '23 at 12:29

LanX

478
3
10

"_I'd go for the "branch reset" (?| pattern1 | pattern2 | ... ) like already suggested by @bobble_bubble (as comment only)_" -- But, as I stated when I mentioned this in a comment under my answer, it still returns the list of the length of the _longest_ branch, so it decidedly returns `undef` values as well (when the shorter branch matches). This does not answer the question. Btw, this is not contrary to the docs I linked to. – zdim Apr 10 '23 at 17:43
1

@zdim I addressed this at length, please read the whole answer. And I didn't refer to the docs you linked to. – LanX Apr 10 '23 at 18:01
"_please read the whole answer_" -- alright, did so carefully this time (and I did miss a few bits). Still, I am not sure what this answer aims for: the question asks for how to conflate patterns with different numbers of captures, so that you get back actual captures, no `undef`s. That's precisely the question, and it's a good one; it'd be nice to avoid those 'undef`s. This answer is a nice discussion, and with some extras, but it doesn't answer the question. From what I see this simply offers a different approach, perhaps with some benefits, but which suffers the same problem. – zdim Apr 10 '23 at 19:26
Then you "_don't see the point in eliminating `undef`_"... well, maybe that is indeed misplaced, but _that was the question_. Some other answers here seem to just ignore that, and I'll repeat my opinion: it is a good question. I don't see why it's considered irrelevant -- getting those `undef` is a nuisance for which we normally _have to_ do something. It'd be nice to not have to. (As for their comment, that came in reference to my original comment, they indeed link to different docs. My bad. I recommend referring to perldoc, like you do.) – zdim Apr 10 '23 at 19:29
Finally, again -- this is a good post I think. I hope I managed to explain the reason for my original comment here. – zdim Apr 10 '23 at 19:34
1. From what I can see it is exactly answering the question, which is proven by the test suite. – LanX Apr 10 '23 at 19:35
2. It's not clear if the OP just wanted get rid of the other undefs. – LanX Apr 10 '23 at 19:36
1

3. Branch-reset is worth to have a prominent answer and not being hidden in some comments. Other searching here might just want that. – LanX Apr 10 '23 at 19:38
4. Finally i said "I would go that way" and offered the OP an alternative which is different to all the other answers. You might want to wait till he comes back to offer his opinion. – LanX Apr 10 '23 at 19:39
Branch reset is the essential part, i took care that the grepping can be left out. – LanX Apr 10 '23 at 19:43
(1) "_it is exactly answering the question, which is proven by the test suite_" -- hum? but you just filter it like another answer (mine) does, and what you (nicely) quote. So that's not a new answer, that repeats an existing answer. (2) "_not clear if the OP just wanted get rid of the other undefs_" See title. They even edited to make it clearer. Read the Q again. (3) "_Branch-reset is worth to have a prominent answer_" Sure, but I don't see why this question warrants it more than any other alternation? (4) "_Ii said "I would go that way"_" I don't find that clear, thus my comment. – zdim Apr 10 '23 at 19:44

Schwern · Accepted Answer · 2023-04-10T04:00:44.223

2

Since you're matching two different things, it seems perfectly reasonable to have two different matches.

But, if you do want to combine them, you can do this:

m{^
  (\S+)
  (?:
    =([fisuo]) |
    (!)
  )
  $
}x

$1 is the name. $2 is the switch, if present. $3 is the !, if present.

For anything more complicated, use named captures or Regexp::Assemble.

Demonstration

edited Apr 10 '23 at 04:00

answered Apr 10 '23 at 03:47

Schwern

153,029
25
195
336

Merge two regexes with variable number of capture groups

4 Answers4

Demo

strategic update