Perl regular expression isn't greedy enough

Question

I'm writing a regular expression in perl to match perl code that starts the definition of a perl subroutine. Here's my regular expression:

my $regex = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';

$regex matches code that starts a subroutine. I'm also trying to capture the name of the subroutine in $1 and any white space and comments between the subroutine name and the initial open brace in $2. It's $2 that is giving me a problem.

Consider the following perl code:

my $x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    $x = 2;
    return;
}

When I put this perl code into a string and match it against $regex, $2 is "# This is comment 3.\n", not the three lines of comments that I want. I thought the regular expression would greedily put all three lines of comments into $2, but that seems not to be the case.

I would like to understand why $regex isn't working and to design a simple replacement. As the program below shows, I have a more complex replacement ($re3) that works. But I think it's important for me to understand why $regex doesn't work.

use strict;
use English;

my $code_string = <<END_CODE;
my \$x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    \$x = 2;
    return;
}
END_CODE

my $re1 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
my $re2 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{';
my $re3 = '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{';

print "\$code_string is '$code_string'\n";
if  ($code_string =~ /$re1/) {print "For '$re1', \$2 is '$2'\n";}
if  ($code_string =~ /$re2/) {print "For '$re2', \$2 is '$2'\n";}
if  ($code_string =~ /$re3/) {print "For '$re3', \$2 is '$2'\n";}
exit 0;

__END__

The output of the perl script above is the following:

$code_string is 'my $x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    $x = 2;
    return;
} # sub zz
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{', $2 is '
# This is comment 1.
# This is comment 2.
# This is comment 3.
'

See also [`PPI`](http://search.cpan.org/perldoc?PPI). e.g., `$subs=PPI::Document->new(\$code_string)->find('PPI::Statement::Sub');...` — mob, Mar 13 '12 at 20:42

Ryan C. Thompson · Accepted Answer · 2017-05-19T20:12:30.317

Look at only the part of your regex that captures $2. It is (\s*#.*\n). By itself, this can only capture a single comment line. You have an asterisk after it in order to capture multiple comment lines, and this works just fine. It captures multiple comment lines and puts each of them into $2, one by one, each time replacing the previous value of $2. So the final value of $2 when the regex is done matching is the last thing that the capturing group matched, which is the final comment line. Only. To fix it, you need to put the asterisk inside the capturing group. But then you need to put another set of parentheses (non-capturing, this time) to make sure the asterisk applies to the whole thing. So instead of (\s*#.*\n)*, you need ((?:\s*#.*\n)*).

Your third regex works because you unwittingly surrounded the whole expression in parentheses so that you could put a question mark after it. This caused $2 to capture all the comments at once, and $3 to capture only the final comment.

When you are debugging your regex, make sure you print out the values of all the match variables you are using: $1, $2, $3, etc. You would have seen that $1 was just the name of the subroutine and $2 was only the third comment. This might have led you to wonder how on earth your regex skipped over the first two comments when there is nothing between the first and second capturing groups, which would eventually lead you in the direction of discovering what happens when a capturing group matches multiple times.

~~By the way, it looks like you are also capturing any whitespace after the subroutine name into $1. Is this intentional?~~ (Oops, I messed up my mnemonics and thought \w was "w for whitespace".)

Thanks. I think you solved the problem. In fact, I was printing the values of $1, $2, ... while debugging. I minimized the test code that I posted here.||||Concerning $1, the part of the regular expression that matches it is '([a-zA-Z_]\w*)', an alphabetic character or underscore followed by zero or more alphabetic characters, underscores and digits. None of those match white space. I've tested it. — David Levner, Mar 13 '12 at 20:26

score 4 · Answer 2 · answered Mar 13 '12 at 19:59

If you add repetition to a capturing group, it will only capture the final match of that group. This is why $regex only matches the final comment line.

Here is how I would rewrite you regex:

my $regex = '\s*sub\s+([a-zA-Z_]\w*)((?:\s*#.*\n)*)\s*\{';

This is very similar to your $re3, except for the following changes:

The white space and comment matching portion is now in a non-capturing group
I changed that portion of the regex from ((...)+)? to ((...)*) which is equivalent.

Thanks. I see it now. It seems that the additional parentheses are necessary for what I want to do. — David Levner, Mar 13 '12 at 20:38

score 1 · Answer 3 · answered Mar 13 '12 at 19:55

1

The problem is that by default the \n isn't part of the string. The regex stops matching at \n.

You need to use the s modifier for multi-line matches:

if  ($code_string =~ /$re1/s) {print "For '$re1', \$2 is '$2'\n";}

Note the s after the regex.

answered Mar 13 '12 at 19:55

Nathan Fellman

122,701
101
260
319

This is incorrect, `\n` is a part of the string and the regex does continue to match, otherwise none of the OP's expressions would match. – Andrew Clark Mar 13 '12 at 20:06
Yes, although this regex could be better written using the `s` and possibly `m` modifiers, it matches fine as is without them. This isn't the problem. – Ryan C. Thompson Mar 13 '12 at 20:14

Perl regular expression isn't greedy enough

3 Answers3