perl multiline regex to separate comments within paragraphs

Question

The script below works, but it requires a kludge. By "kludge" I mean a line of code which makes the script do what I want --- but I do not understand why the line is necessary. Evidently, I do not understand exactly what the multiline regex substitution, ending /mg, is doing.

Is there not a more elegant way to accomplish the task?

The script reads through a file by paragraphs. It partitions each paragraph into two subsets: $text and $cmnt. The $text includes the left part of every line, i.e., from the first column up to the first %, if it exists, or to end of the line if it doesn't. The $cmnt includes the rest.

Motivation: The files to be read are LaTeX markup, where % announces the beginning of a comment. We could change the value of $breaker to equal # if we were reading through a perl script. After separating $text from $cmnt, one could perform a match across lines such as

print "match" if ($text =~ /WOLF\s*DOG/s);

Please see the line labeled "kludge." Without that line, something funny happens after the last % in a record. If there are lines of $text (material not commented out by %) after the last commented line of the record, those lines are included both at the end of $cmnt and in $text.

In the example below, this means that without the kludge, in record 2, "cat lion" is included both in the $text, where it belongs, and also in $cmnt.

(The kludge causes an unnecessary % to appear at the end of every non-void $cmnt. This is because the kludge-pasted-on % announces a final, fictitious empty comment line.)

According to https://perldoc.perl.org/perlre.html#Modifiers, the /m regex modifier means

Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.

Therefore, I would expect the 2nd match in

s/^([^$breaker]*)($breaker.*?)$/$2/mg

to start with the first %, to extend as far of end-of-line, and stop there. So even without the kludge, it should not include the "cat lion" in record 2? But obviously it does, so I am misreading, or missing, some part of the documentation. I suspect it has to do with the /g regex modifier?

#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++; 
    my $text = $_; 
    my $cmnt;
    s/[\n]*\z/$breaker/; # kludge
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)  # non-greedy
    {
        $cmnt    = $_; 
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);  # non-greedy
    }
    else
    {
        $cmnt    = ''; 
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

Example file to run it on:

dog wolf % flea 
DOG WOLF % FLEA 
DOG WOLLLLLLF % FLLLLLLEA 


% what was that?
 cat lion


no comments in this line




%The last paragraph of this file is nothing but a single-line comment.

Håkon Hægland · Accepted Answer · 2020-09-22T20:32:36.323

1

You must also delete the lines that does not contain a comment from $cmnt:

use feature qw(say);
use strict;
use warnings;

my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++;
    my $text = $_;
    my $cmnt;
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)  # non-greedy
    {
        $cmnt    = $_;
        $cmnt =~ s/^[^$breaker]*?$//mg;
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);  # non-greedy
    }
    else
    {
        $cmnt    = '';
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

Output:

RECORD 1:
******** text==
|dog wolf 
DOG WOLF 
DOG WOLLLLLLF 

|
******** cmnt==|% flea 
% FLEA 
% FLLLLLLEA 
|

RECORD 2:
******** text==
|
 cat lion

|
******** cmnt==|% what was that?

|

RECORD 3:
******** text==
|no comments in this line

|
******** cmnt==||

RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|

edited Sep 22 '20 at 20:32

answered Sep 22 '20 at 19:53

Håkon Hægland

39,012
21
81
174

What is `/x` doing? https://perldoc.perl.org/perlre.html#%2fx-and-%2fxx says "Extend your pattern's legibility by permitting whitespace and comments," but this /x does more than that, correct? – Jacob Wegelin Sep 22 '20 at 20:18
I only used `/x` as a trick to avoid having `$.` interpreted as the special variable for the current line number. – Håkon Hægland Sep 22 '20 at 20:20
Your substitution ignores or zeroes out everything from the match to the end of record, I think. How does this work on a multiline record without `/g`? I tried this code with `$str2="elephants, %really??\nbadgers % more likely %Joe said no.\n% what was that?\ncat lion";`, that is, with 4 lines, the last one without any comment, and the outcome was only the 1st comment: `2: $2 = '%really??', $str = '%really??'`. – Jacob Wegelin Sep 22 '20 at 20:28
Yes you are right, so in the end you want `$cmnt` to contain all the comments (on all lines)? Is that correct? Se my updated answer – Håkon Hægland Sep 22 '20 at 20:29
1

Yes. I want to iterate through a LaTeX file paragraph by paragraph and be able to match for a pattern, possibly across lines, either inside `$cmnt` or inside `$text`. It's entirely possible that `print "match" if ( $cmnt=~ /Bob\s*said\s*no/s );` could be a useful search (with "Bob said no" replaced by some other strings). – Jacob Wegelin Sep 22 '20 at 20:37
This solution is elegant and simple. Your single line deletes the irrelevant material, leaving only the comment. And I think that my `die "cmnt does not match"` line can be removed from the code as well. – Jacob Wegelin Sep 22 '20 at 20:46

score 1 · Answer 2 · answered Sep 23 '20 at 18:33

My main source of confusion was a failure to distinguish between

whether or not an entire record matches -- here, a record is potentially a multi-line paragraph, and
whether or not a line inside a record matches.

The following script incorporates insights from both answers that others offered, and includes extensive explanation.

#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';

$/ = ''; # one paragraph at a time
while(<DATA>)
{
    $count_record++; 
    my $text = $_; 
    my $cmnt;
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    print "RECORD $count_record:";
    print "\n|"; print $_; print "|\n";
    # https://perldoc.perl.org/perlre.html#Modifiers
    # the following regex:
    # ^                     /m: ^==start of line, not of record
    # ([^$breaker]*)        zero or more characters that are not $breaker
    # ($breaker.*?)         non-greedy: the first instance of $breaker, followed by everything after $breaker
    # $                     /m: $==end   of line, not of record
    #                       /g: "globally match the pattern repeatedly in the string"
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)
    {
        $cmnt    = $_; 
        # In at least one line of this record, the pattern above has matched.
        # But this does not mean every line matches. There may be any number of
        # lines inside the record that do not match /$breaker/; for these lines,
        # in spite of /g, there will be no match, and thus the exclusion of $1 and printing only of $2,
        # in the substitution below, will not take place. Thus, those particular lines must be deleted from $cmnt. 
        # Thus:
        $cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
        # recall that /m guarantees that ^ and $ match the start and end of the line, not of the record.
        die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);
        if ( $text =~ /\S/ )
        {
            print "|text|==\n|$text|\n";
        }
        else
        {
            print "NO text found\n";
        }
        print "|cmnt|==\n|$cmnt|\n";
    }
    else
    {
        print "NO comment found\n";
    }
}

__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal, the seven days
ate
niner
ten

As Douglass said to Lincoln ... 

%Darryl Pinckney

score -1 · Answer 3 · answered Sep 22 '20 at 19:37

Regular expression modifier mg assumes that a string it applied to includes multiple lines (includes \n in the string). It instructs regular expression to look through all lines in the string.

Please study following code which should simplify solution to your problem.

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $breaker = '%';
my @records = do { local $/ = ''; <DATA> };

for( @records ) {
    my %hash = ( /(.*?)$breaker(.*)/mg );
    next unless %hash;
    say Dumper(\%hash);
}

__DATA__
dog wolf % flea 
DOG WOLF % FLEA 
DOG WOLLLLLLF % FLLLLLLEA 


% what was that?
 cat lion


no comments in this line




%The last paragraph of this file is nothing but a single-line comment.

Output

$VAR1 = {
          'DOG WOLF ' => ' FLEA ',
          'dog wolf ' => ' flea ',
          'DOG WOLLLLLLF ' => ' FLLLLLLEA '
        };

$VAR1 = {
          '' => ' what was that?'
        };

$VAR1 = {
          '' => 'The last paragraph of this file is nothing but a single-line comment.'
        };

According to perdoc perlre, regex modifier `/m` means that `$` matches the end of a line. And I think that `/mg` makes it happen repeatedly, for all end-of-lines in the record. — Jacob Wegelin, Sep 22 '20 at 21:02

perl multiline regex to separate comments within paragraphs

3 Answers3

Linked