4

I have text files containing lines like:

2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71

I am matching them with /^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/ But I also have some files with lines in a completely different format, which I match with a different regex. When I open a file I determine which format and assign $pat = '<regex-string>'; in a switch/case block:

$pat = '/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/'

But the ? character that introduces the non-capturing group I use to match repeats after the date and before the first currency amount causes the Perl interpreter to fail to compile the script, reporting on abort:

syntax error at ./report-dates-amounts line 28, near "}continue "

If I delete the ? character, or replace ? with \? escaped character, or first assign $q = '?' then replace ? with $q inside a " string assignment (ie. $pat = "/^\s*(\S+)\s+($q:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/"; ) the script compiles and runs. If I assign the regex string outside the switch/case block that also works OK. Perl v5.26.1 .

My code also doesn't have any }continue in it, which as reported in the compilation failure is probably some kind of transformation of the switch/case code by Switch.pm into something native the compiler chokes on. Is this some kind of bug in Switch.pm? It fails even when I use given/when in exactly the same way.

#!/usr/local/bin/perl

use Switch;

# Edited for demo
switch($format)
{
    # Format A eg:
    #     2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
    #     3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
    #     3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
    #     4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
    #
    case /^(?:april|snow)$/i
    { # This is where the ? character breaks compilation:
        $pat = '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$';

      # WORKS:
      # $pat = '^\s*(\S+)\s+(' .$q. ':[0-9|\)| ]+)+\s+\D' .$q. '(\S+)\s+\$';
    }

    # Format B
    case /^(?:umberto|petro)$/i
    {
        $pat = '^(\S+)\s+.*Think 1\s+(\S+)\s+';
    }
}
Matthew
  • 757
  • 11
  • 19
  • On a little further testing it appears that a ? character appearing anywhere inside a case block, even in a # comment, causes compilation to fail. Ie. this script too: #!/usr/local/bin/perl use Switch; # Edited for demo switch($format) { case /^(?:april|snow)$/i { # ? causes compilation to fail. $x = 1; } # Format B case /^(?:umberto|petro)$/i { # This comment isn't a problem. $y = 1; } } – Matthew Apr 14 '19 at 19:31
  • 2
    [Switch](http://p3rl.org/Switch) uses a source filter, which is not recommended for any serious work. Don't use it. – choroba Apr 14 '19 at 19:42
  • It fails with given/when instead of switch/case the same way. – Matthew Apr 14 '19 at 20:02
  • `when` requires the regex in parentheses when it's not used as a statement modifier. – choroba Apr 14 '19 at 20:13
  • How do I parenthesize the regex in the `when` block? I tried `when (/^(?:april|snow)$/i)` but that failed with a syntax error. – Matthew Apr 14 '19 at 20:19
  • Did you `use feature 'switch'` or `use v5.10`+ ? You need one of those to use `given` which was introduced in Perl v5.10. – Dada Apr 14 '19 at 20:21
  • With `when (/.../)` it works for me. – choroba Apr 14 '19 at 20:23
  • I replaced `use Switch;` with `use feature 'switch';` but compilation failed with `given is experimental` and `when is experimental`, as it did with `use feature ':5.10';` and with `use v5.10;` instead. – Matthew Apr 14 '19 at 20:25
  • 1
    `given is experimental`/`when is experimental` are just warnings, not errors. If you'd want to silence them, `no warnings 'experimental::smartmatch'` (but really, consider my answer before using `switch` or `given`). – Dada Apr 14 '19 at 20:27
  • I rewrote the `switch/case` to use `for/if/elsif/else` as per Dada's answer below. It's astounding that in its 4th decade of development Perl still doesn't implement a straightforward switch/case statement block. The design is too ambitious at the expense of basic functionality. – Matthew Apr 14 '19 at 20:34
  • 1
    I completely agree with your last statement @Matthew. I guess `given/when` was an attempt to remediate to that, but unfortunately ended up failing because of smartmatch, leaving us without a simple builtin `switch/case`.. – Dada Apr 14 '19 at 20:44

2 Answers2

4

Don't use Switch. As mentionned by @choroba in the comments, Switch uses a source filter, which leads to mysterious and hard to debug errors, as you constated.

The module's documentation itself says:

In general, use given/when instead. It were introduced in perl 5.10.0. Perl 5.10.0 was released in 2007.

However, given/when is not necessarily a good option as it is experimental and likely to change in the future (it seems that this feature was almost removed from Perl v5.28; so you definitely don't want to start using it now if you can avoid it). A good alternative is to use for:

for ($format) {
    if (/^(?:april|snow)$/i) {
       ...
    } 
    elsif (/^(?:umberto|petro)$/i) {
       ...
    }
}

It might look weird a first, but once you get used to it, it's actually reasonable in my opinion. Or, of course, you can use none of this options and just do:

sub pattern_from_format {
    my $format = shift;

    if ($format =~ /^(?:april|snow)$/i) {
       return qr/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$/;
    } 
    elsif ($format =~ /^(?:umberto|petro)$/i) {
        return qr/^(\S+)\s+.*Think 1\s+(\S+)\s+/;
    }
    # Some error handling here maybe
 }

If, for some reason, you still want to use Switch: use m/.../ instead of /.../.

I have no idea why this bug is happening, however, the documentation says:

Also, the presence of regexes specified with raw ?...? delimiters may cause mysterious errors. The workaround is to use m?...? instead.

Which I misread at first, and therefore tried to use m/../ instead of /../, which fixed the issue.

Dada
  • 6,313
  • 7
  • 24
  • 43
  • A `for/if/elsif/else` block worked. It's not too weird. Though `for($format)` simply populating `$_` seems weird, no looping. Wouldn't just `$_ = $format` (or otherwise assigning $_, perhaps implicitly with some other function) before the `if/elsif/else` blocks work the same? – Matthew Apr 14 '19 at 20:35
  • `for` does a bit more, as within the for, `$_` is an alias for `$format`: `perl -E '$x=5;for($x){$_=2};say $x'` prints `2` whereas `perl -E '$x=5;{$_=$x;$_=2};say $x'` prints `2`. In that case though, it doesn't change anything. I tend to dislike statements like `$_ = $format`, but I'm not really sure why. – Dada Apr 14 '19 at 20:42
  • 1
    Thank you for all of the clarity. FWIW I'd bet that if all the programmer-years spent complaining were instead spent implementing a more basic `switch/case` then we'd have one :). – Matthew Apr 14 '19 at 20:46
  • @Matthew See [Switch::Plain](https://metacpan.org/pod/Switch::Plain) for one way a more basic switch/case might look. – Grinnz Apr 15 '19 at 17:25
  • @Matthew I agree the situation is unfortunate, but the reality is that everyone has different ideas for what "basic" means in a language without strict typing, and what functionality they want, and that's how we got into the smartmatch mess to begin with. – Grinnz Apr 15 '19 at 17:26
1

Another option instead of an if/elsif chain would be to loop over a hash which maps your regular expressions to the values which should be assigned to $pat:

#!/usr/local/bin/perl

my %switch = (
  '^(?:april|snow)$'    => '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$',
  '^(?:umberto|petro)$' => '^(\S+)\s+.*Think 1\s+(\S+)\s+',
);

for my $re (keys %switch) {
  if ($format =~ /$re/i) {
    $pat = $switch{$re};
    last;
  }
}

For a more general case (i.e., if you're doing more than just assigning a string to a scalar) you could use the same general technique, but use coderefs as the values of your hash, thus allowing it to execute an arbitrary sub based on the match.

This approach can cover a pretty wide range of the functionality usually associated with switch/case constructs, but note that, because the conditions are pulled from the keys of a hash, they'll be evaluated in a random order. If you have data which could match more than one condition, you'll need to take extra precautions to handle that, such as having a parallel array with the conditions in the proper order or using Tie::IxHash instead of a regular hash.

Dave Sherohman
  • 45,363
  • 14
  • 64
  • 102
  • Well, actually for each format the code must do more than just assign a regex to `$pat` , but I redacted that for posting this Question. I could take your suggestion and set `%switch` values to the `sub`s to call against a `$format` match, the Command Pattern. – Matthew Apr 16 '19 at 15:28