The script below works, but it requires a kludge. By "kludge" I mean a line of code which makes the script do what I want --- but I do not understand why the line is necessary. Evidently, I do not understand exactly what the multiline regex substitution, ending /mg
, is doing.
Is there not a more elegant way to accomplish the task?
The script reads through a file by paragraphs. It partitions each paragraph into two subsets: $text
and $cmnt
. The $text
includes the left part of every line, i.e., from the first column up to the first %
, if it exists, or to end of the line if it doesn't. The $cmnt
includes the rest.
Motivation: The files to be read are LaTeX markup, where %
announces the beginning of a comment. We could change the value of $breaker
to equal #
if we were reading through a perl script. After separating $text
from $cmnt
, one could perform a match across lines such as
print "match" if ($text =~ /WOLF\s*DOG/s);
Please see the line labeled "kludge."
Without that line, something funny happens after the last %
in a record. If there are lines of $text
(material not commented out by %
) after the last commented line of the record, those lines are included both at the end of $cmnt
and in $text
.
In the example below, this means that without the kludge, in record 2, "cat lion" is included both in the $text
, where it belongs, and also in $cmnt
.
(The kludge causes an unnecessary %
to appear at the end of every non-void $cmnt
. This is because the kludge-pasted-on %
announces a final, fictitious empty comment line.)
According to https://perldoc.perl.org/perlre.html#Modifiers, the /m
regex modifier means
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
Therefore, I would expect the 2nd match in
s/^([^$breaker]*)($breaker.*?)$/$2/mg
to start with the first %
, to extend as far of end-of-line, and stop there. So even without the kludge, it should not include the "cat lion" in record 2? But obviously it does, so I am misreading, or missing, some part of the documentation. I suspect it has to do with the /g
regex modifier?
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/$breaker/; # kludge
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg) # non-greedy
{
$cmnt = $_;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
Example file to run it on:
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.