Why does perl regex with /mg modifier match past end-of-line?

Question

This is related to perl multiline regex to separate comments within paragraphs, but focuses exclusively on a single question of regex syntax.

According to perlre: Modifiers, the /m regex modifier means

Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.

Thus, with the following code:

#!/usr/bin/perl
use strict; use warnings;
$/ = ''; # one paragraph at a time
while(<DATA>)
{
    print "original:\n";
    print; 
    s/^([^B]*)(B.*?)$/>$1|$2</mg;
    print "\n\nafter substitution:\n";
    print; 
}

__DATA__
aaaaBaBaBB
bbbbBbadbe
cccc
dddd
eeeeBeeeee
ffff
gggg

I expected the regex engine to behave as follows.

line 1: match, because it finds both patterns between the start and end of this line.

line 2: ditto.

line 3: no match. The 1st regex-group (in the 1st set of parentheses) matches. But when we reach the end of the line, we are still looking for B, to begin the 2nd regex-group. Since we have specified /m, the end of this particular line means we have reached $ without satisfying the entire pattern.

line 4: We start a new line so we encounter a new ^. Again, no match.

line 5: match. Both regex-groups lie between the start and end of the line, i.e., between ^ and $, exactly as specified.

Thus I expect to see

>aaaa|BaBaBB<
>bbbb|Bbadbe<
cccc
dddd
>eeee|Beeeee<
ffff
gggg

Instead, it appears that at line 3, the engine ignores end-of-line and searches past it. It treats lines 3--5 as a single line which, if we were willing suddenly to ignore $ signifying end-of-line, would satisfy the regex. Here is what we see:

>aaaa|BaBaBB<
>bbbb|Bbadbe<
>cccc
dddd
eeee|Beeeee<
ffff
gggg

How is this consistent with the /m specification? Where is this behavior documented?

> perl --version

This is perl 5, version 18, subversion 4 (v5.18.4) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

Are you sure you need paragraph mode? It seems like you are only using `/m` to compensate for using paragraph mode, so that `^` and `$` will match each line. Remove it and the code will work. If you do need it, you should split the paragraphs on newline before running the regex. — TLP, Sep 24 '20 at 16:11

score 3 · Answer 1 · edited Sep 24 '20 at 16:12

3

[^B]* will match against as many non-B characters as possible, including newlines. Replacing it with [^B\n]* may do what you want.

edited Sep 24 '20 at 16:12

TLP

66,756
10
92
149

answered Sep 24 '20 at 13:08

Dave Mitchell

2,193
1
6
7

1

Why would it match newlines, when the regex ends with $? perldoc says that with /m, $ matches end of line. I think that what you say would be true with /s. – Jacob Wegelin Sep 24 '20 at 13:24
2

Why would it *not* match newlines? The only exception made for newlines is for the dot meta character `.` which normally does not match newline, unless the `/s` modifier is used. `[^B]*` means *match any character except B 0 or more times* The reason it matches over the newlines is that you slurped all the rows into one string. In line-by-line mode it would not be able to do that. – TLP Sep 24 '20 at 15:18
The `/m` allows $ to match the condition before the end of a line, or the end of the string. [perlre](https://perldoc.pl/perlre) says "$ Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)" – brian d foy Sep 24 '20 at 20:14
The elements of a regex are generally applied in a left-to-right order, until all have passed, or on failure, backtracking and trying another alternative. In your case, the first element [^B]* slurps as many non-B's as it can, then the 'B' element matches the B, then the .*? element slurps as few non-newline chars as possible, followed by the '$' element repeatedly failing to match an end of line and triggering a backtrack until a retry of the .*? slurps enough characters for the '$' to succeed. – Dave Mitchell Sep 25 '20 at 08:14

Håkon Hægland · Accepted Answer · 2020-09-24T14:53:29.430

According to perldoc perlretut :

When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:

Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.

Principle 1: In an alternation a|b|c... , the leftmost alternative that allows a match for the whole regexp will be the one used.

Principle 2: The maximal matching quantifiers '?' , '*' , '+' and {n,m} will in general match as much of the string as possible while still allowing the whole regexp to match.

Principle 3: If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

As we have seen above, Principle 0 overrides the others. The regexp will be matched as early as possible, with the other principles determining how the regexp matches at that earliest character position.
[...]
We can modify principle 3 above to take into account non-greedy quantifiers:

Principle 3: If there are two or more elements in a regexp, the leftmost greedy (non-greedy) quantifier, if any, will match as much (little) of the string as possible while still allowing the whole regexp to match. The next leftmost greedy (non-greedy) quantifier, if any, will try to match as much (little) of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

So for this case

my $str = 'cccc
dddd
eeeeBeeeee
ffff
gggg';
$str =~ s/^([^B]*)(B.*?)$/>$1|$2</m;

we use principle 0 and principle 3 and hence it will match at the beginning position (position 0) in $str. According to principle 3, we start with the leftmost element:

^([^B]*)

It will match "as much of the string as possible while still allowing the whole regexp to match.", this means it will be able to match from the beginning of the string and up the first B. Then the engine considers the next element

(B.*?)$

Still, according to principle 3: It will match "as little of the string as possible while still allowing the whole regexp to match." So it will match from the B to the first new line found.

Sometimes this is called "leftmost longest". – brian d foy Sep 24 '20 at 20:41 — brian d foy, Sep 24 '20 at 20:41

Timur Shtatland · Answer 3 · 2020-09-24T21:28:37.563

The Perl documentation for /m and /s modifiers and character classes could benefit from connecting the dots and adding a few more examples, which I will attempt here.

Regardless of /m and /s modifiers, a character class is allowed to match a newline. That's why [^B]* matches \n and gets extended through multiple newlines in your case. In fact, you can specify a character class that explicitly contains ([\n]) or does not contain ([^\n]) a newline. In addition to newline character (\n), there is also a non-newline character (\N).

The /s modifier only alters the behavior of . (it allows . to match a newline). It does not alter the behavior of any other character classes.

You can get markedly different behavior using /m and /s modifiers alone, as shown in the examples below. This behavior is as documented, and hence predictable, but not always intuitive. I typically use these modifiers together (/ms), and found that it makes my code more intuitive and maintainable. This way, I do not have to think every time about the newline matching behavior. In fact, I typically use /xms modifiers in most regexes in my own code as a matter of habit, with /x allowing the code to be more readable and maintainable (Conway (2005), p. 236-241, Vromans (2006)).

REFERENCES:

perlrecharclass - Perl Regular Expression Character Classes: Backslash sequences

\N Match a character that isn't a newline.

perlreref - Perl Regular Expressions Reference: CHARACTER CLASSES

\N A non newline (when not followed by '{NAME}';;
not valid in a character class; equivalent to [^\n]; it's
like '.' without /s modifier)

perlre - Perl regular expressions: Modifiers

m
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.

s
Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.

Used together, as /ms , they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.

(Note that it says nothing about /m or /s altering character classes other than '.', so we can infer from here that they are not altered)

Using /xms modifiers:

Always use the /x flag.

Always use the /m flag.

Always use the /s flag.

(Conway (2005), p. 236-241, Vromans (2006))

Damian Conway (2005) Perl Best Practices: Standards and Styles for Developing Maintainable Code. O'Reilly Media. https://www.amazon.com/Perl-Best-Practices-Developing-Maintainable/dp/0596001738/

Perl Best Practices: Reference Guide: https://www.squirrel.nl/pub/PBP_refguide-1.02.00.pdf

EXAMPLES:

use strict;
use warnings;
use feature qw( say );

my @strings = (
    "abcd\n",        # single-line string
    "abcd\nabcd\n",  # multi-line string (first string repeated twice)
    "abXd\nabcd\n",  # multi-line string, same as above, but missing first 'c'
    "abcd\nabXd\n",  # multi-line string, same as above, but missing first 'c'
);
my @regexes = ( '^([^c]*)(c.*?)$' );

foreach my $string ( @strings ) {
    foreach my $regex ( @regexes ) {
        my @matches;
        say "\n###";
        say "# \$string='$string'; \$regex='$regex'";
        
        @matches = map { "'$_'" } $string =~ /$regex/;
        say "regex_modifiers='';   \@matches=@matches;";
        
        @matches = map { "'$_'" } $string =~ /$regex/m;
        say "regex_modifiers='m';  \@matches=@matches;";
        
        @matches = map { "'$_'" } $string =~ /$regex/s;
        say "regex_modifiers='s';  \@matches=@matches;";
        
        @matches = map { "'$_'" } $string =~ /$regex/ms;
        say "regex_modifiers='ms'; \@matches=@matches;";
        
    }
}

Output:

###
# $string='abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers='';   @matches='ab' 'cd';   # ok
regex_modifiers='m';  @matches='ab' 'cd';   # /m, /s modifiers do not matter in single-line string
regex_modifiers='s';  @matches='ab' 'cd';   # /m, /s modifiers do not matter in single-line string
regex_modifiers='ms'; @matches='ab' 'cd';   # /m, /s modifiers do not matter in single-line string

###
# $string='abcd
abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers='';   @matches=;            # '.' does not match newline, cannot reach end of string
regex_modifiers='m';  @matches='ab' 'cd';   # '$' matches first newline
regex_modifiers='s';  @matches='ab' 'cd
abcd';                                      # '.' matches newline, so the end of string is reached
                                            # and '$' matches it.
regex_modifiers='ms'; @matches='ab' 'cd';   # non-greedy '.*?' causes '$' to match the first newline

###
# $string='abXd
abcd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers='';   @matches='abXd
ab' 'cd';                                    # [^c] matches newline, /m, /s modifiers do not matter 
regex_modifiers='m';  @matches='abXd
ab' 'cd';                                    # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='s';  @matches='abXd
ab' 'cd';                                    # [^c] matches newline, /m, /s modifiers do not matter
regex_modifiers='ms'; @matches='abXd
ab' 'cd';                                    # [^c] matches newline, /m, /s modifiers do not matter

###
# $string='abcd
abXd
'; $regex='^([^c]*)(c.*?)$'
regex_modifiers='';   @matches=;             # '.' does not match newline, cannot reach end of string
regex_modifiers='m';  @matches='ab' 'cd';    # matches second line 
regex_modifiers='s';  @matches='ab' 'cd
abXd';                                       # '.' matches newline, so the end of string is reached
                                             # and '$' matches it.
regex_modifiers='ms'; @matches='ab' 'cd';    # non-greedy '.*?' causes '$' to match the first newline

Why would you use `/xms` modifiers as a matter of habit? That seems rather a strange choice. — TLP, Sep 24 '20 at 20:44
@TLP Thank you for the comment! I also found this practice strange when I encountered it first, but tried it and liked it. I clarified the source and added the citation and references to the book "Perl Best Practices" by Damian Conway. I realize that it is not a common practice (certainly does not appear so on SO), but rather a matter of style and personal preference. :) — Timur Shtatland, Sep 24 '20 at 21:33

Why does perl regex with /mg modifier match past end-of-line?

3 Answers3