1

I recently started learning Perl to automate some mindless data tasks. I work on windows machines, but prefer to use Cygwin. Wrote a Perl script that did everything I wanted fine in Cygwin, but when I tried to run it with Strawberry Perl on Windows via CMD I got the "Unescaped left brace in regex is illegal here in regex," error.

After some reading, I am guessing my Cygwin has an earlier version of Perl and modern versions of Perl which Strawberry is using don't allow for this. I am familiar with escaping characters in regex, but I am getting this error when using a capture group from a previous regex match to do a substitution.

open(my $fh, '<:encoding(UTF-8)', $file)
    or die "Could not open file '$file' $!";
my $fileContents = do { local $/; <$fh> };

my $i = 0;
while ($fileContents =~ /(.*Part[^\}]*\})/) {
    $defParts[$i] = $1;
    $i = $i + 1;
    $fileContents =~ s/$1//;
}

Basically I am searching through a file for matches that look like:

Part
{
    Somedata
}

Then storing those matches in an array. Then purging the match from the $fileContents so I avoid repeats.

I am certain there are better and more efficient ways of doing any number of these things, but I am surprised that when using a capture group it's complaining about unescaped characters.

I can imagine storing the capture group, manually escaping the braces, then using that for the substitution, but is there a quicker or more efficient way to avoid this error without rewriting the whole block? (I'd like to avoid special packages if possible so that this script is easily portable.)

All of the answers I found related to this error were with specific cases where it was more straightforward or practical to edit the source with the curly braces.

Thank you!

ikegami
  • 367,544
  • 15
  • 269
  • 518
Asterdahl
  • 85
  • 6

2 Answers2

3

As for the question of escaping, that's what quotemeta is for,

my $needs_escaping = q(some { data } here);
say quotemeta $needs_escaping;

what prints (on v5.16)

some\ \{\ data\ \}\ here

and works on $1 as well. See linked docs for details. Also see \Q in perlre (search for \Q), which is how this is used inside a regex, say s/\Q$1//;. The \E stops escaping (what you don't need).

Some comments.

Relying on deletion so that the regex keeps finding further such patterns may be a risky design. If it isn't and you do use it there is no need for indices, since we have push

my @defParts;
while ($fileContents =~ /($pattern)/) {
    push @defParts, $1;
    $fileContents =~ s/\Q$1//;
}

where \Q is added in the regex. Better yet, as explained in melpomene's answer the substitution can be done in the while condition itself

push @defParts, $1  while $fileContents =~ s/($pattern)//;

where I used the statement modifier form (postfix syntax) for conciseness.

With the /g modifier in scalar context, as in while (/($pattern)/g) { .. }, the search continues from the position of the previous match in each iteration, and this is a usual way to iterate over all instances of a pattern in a string. Please read up on use of /g in scalar context as there are details in its behavior that one should be aware of.

However, this is tricky here (even as it works) as the string changes underneath the regex. If efficiency is not a concern, you can capture all matches with /g in list context and then remove them

my @all_matches = $fileContents =~ /$patt/g;
$fileContents =~ s/$patt//g;

While inefficient, as it makes two passes, this is much simpler and clearer.

I expect that Somedata cannot possibly, ever, contain }, for instance as nested { ... }, correct? If it does you have a problem of balanced delimiters, which is far more rounded. One approach is to use the core Text::Balanced module. Search for SO posts with examples.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Thanks for the detailed response! It seems based on melpomene's response that I should (as I expected) just avoid substituting the capture group entirely but it's good to know how to know about quotemeta if I ever actually do need to do a similar operation, thank you! Also, you are correct, Somedata cannot possibly ever contain "}," but thanks for pointing out balanced delimiters. – Asterdahl May 08 '18 at 17:44
  • @Asterdahl I answered _before_ melpomene, see the time stamps (with small edits later). Regardless, theirs is a good answer. Thanks for a comment :) – zdim May 08 '18 at 19:20
  • Ah, I assumed yours came second but it seems you edited it to reference melpomene's answer. I was torn between which to rate as the answer, because yours directly answered how to deal with the escaped characters but melpomene's is probably the best way to change the code. Both great answers and both got my upvote! Thanks again. – Asterdahl May 08 '18 at 19:55
  • @Asterdahl I added that one bit, for completeness of my post even as it had already been shown in another answer. For that same reason I felt obliged to quote it. The answers mostly overlap, what results in having different wording and choices of what is said for the same techniques. So altogether this should be a useful page. Thank you for your behavior, with choices and votes and comments :) – zdim May 08 '18 at 20:19
3

I would just bypass the whole problem and at the same time simplify the code:

my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
    $defParts[$i] = $1;
    $i = $i + 1;
}

Here we simply do the substitution first. If it succeeds, it will still set $1 and return true (just like plain /.../), so there's no need to mess around with s/$1// later.

Using $1 (or any variable) as the pattern would mean you have to escape all regex metacharacters (e.g. *, +, {, (, |, etc.) if you want it to match literally. You can do that pretty easily with quotemeta or inline (s/\Q$1//), but it's still an extra step and thus error prone.

Alternatively, you could keep your original code and not use s///. I mean, you already found the match. Why use s/// to search for it again?

while ($fileContents =~ /(.*Part[^\}]*\})/) {
    ...
    substr($fileContents, $-[0], $+[0] - $-[0], "");
}

We already know where the match is in the string. $-[0] is the position of the start and $+[0] the position of the end of the last regex match (thus $+[0] - $-[0] is the length of the matched string). We can then use substr to replace that chunk by "".

But let's keep going with s///:

my $i = 0;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
    $defParts[$i] = $1;
    $i++;
}

$i = $i + 1; can be reduced to $i++; ("increment $i").

my @defParts;
while ($fileContents =~ s/(.*Part[^\}]*\})//) {
    push @defParts, $1;
}

The only reason we need $i is to add elements to the @defParts array. We can do that by using push, so there's no need for maintaining an extra variable. This saves us another line.

Now we probably don't need to destroy $fileContents. If the substitution exists only for the benefit of this loop (so I doesn't re-match already extracted content), we can do better:

my @defParts;
while ($fileContents =~ /(.*Part[^\}]*\})/g) {
    push @defParts, $1;
}

Using /g in scalar context attaches a "current position" to $fileContents, so the next match attempt starts where the previous match left off. This is probably more efficient because it doesn't have to keep rewriting $fileContents.

my @defParts = $fileContents =~ /(.*Part[^\}]*\})/g;

... Or we could just use //g in list context, where it returns a list of all captured groups of all matches, and assign that to @defParts.

my @defParts = $fileContents =~ /.*Part[^\}]*\}/g;

If there are no capture groups in the regex, //g in list context returns the list of all matched strings (as if there had been ( ) around the whole regex).

Feel free to choose any of these. :-)

melpomene
  • 84,125
  • 8
  • 85
  • 148
  • Thanks for the fantastic response! I ended up choosing the final while loop with /g solution you suggested. Much more straightforward than my original solution. I played around with the final two solutions, but couldn't get them to work. In both cases when I print the array to shell after setting them, I got the same result as the while loop solution, however, when I later print them into a different, new file (which is the ultimate goal) it simply refused to print anything which baffled me. The other solution works great though and I am happily using it now. – Asterdahl May 08 '18 at 18:03