Negated Named Regex, or Character Class Interpolation in Raku

Question

I'm trying to parse a quoted string. Something like this:

say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;

(From https://docs.raku.org/language/regexes "Enumerated character classes and ranges") But... I want more that one type of quote. Something like this made up syntax that doesn't work:

  token attribute_value { <quote> ($<-quote>) $<quote> };
  token quote           { <["']> };

I found this discussion which is another approach, but it didn't seem to go anywhere: https://github.com/Raku/problem-solving/issues/97. Is there any way of doing this kind of thing? Thanks!

Update 1

I was not able to get @user0721090601's "multi token" solution to work. My first attempt yielded:

$ ./multi-token.raku 
No such method 'quoted_string' for invocant of type 'QuotedString'
  in block <unit> at ./multi-token.raku line 16

After doing some research I added proto token quoted_string {*}:

#!/usr/bin/env raku

use Grammar::Tracer;

grammar QuotedString {
  proto token quoted_string {*}
  multi token quoted_string:sym<'> { <sym> ~ <sym> <-[']> }
  multi token quoted_string:sym<"> { <sym> ~ <sym> <-["]> }
  token quote         { <["']> }
}

my $string = '"foo"';

my $quoted-string = QuotedString.parse($string, :rule<quoted_string>);
say $quoted-string;

$ ./multi-token.raku 
quoted_string
* FAIL
(Any)

I'm still learning Raku, so I could be doing something wrong.

Update 2

D'oh! Thanks to @raiph for pointing this out. I forgot to put a quantifier on <-[']> and <-["]>. That's what I get for copy/pasting without thinking! Works find when you do it right:

#!/usr/bin/env raku

use Grammar::Tracer;

grammar QuotedString {
  proto token quoted_string (|) {*}
  multi token quoted_string:sym<'> { <sym> ~ <sym> <-[']>+ }
  multi token quoted_string:sym<"> { <sym> ~ <sym> <-["]>+ }
  token quote         { <["']> }
}

my $string = '"foo"';

my $quoted-string = QuotedString.parse($string, :rule<quoted_string>);
say $quoted-string;

Update 3

Just to put a bow on this...

#!/usr/bin/env raku

grammar NegativeLookahead {
  token quoted_string { <quote> $<string>=([<!quote> .]+) $<quote> }
  token quote         { <["']> }
}

grammar MultiToken {
  proto token quoted_string (|) {*}
  multi token quoted_string:sym<'> { <sym> ~ <sym> $<string>=(<-[']>+) }
  multi token quoted_string:sym<"> { <sym> ~ <sym> $<string>=(<-["]>+) }
}

use Bench;

my $string = "'foo'";

my $bench = Bench.new;
$bench.cmpthese(10000, {
  negative-lookahead =>
    sub { NegativeLookahead.parse($string, :rule<quoted_string>); },
  multi-token        =>
    sub { MultiToken.parse($string, :rule<quoted_string>); },
});

$ ./bench.raku
Benchmark: 
Timing 10000 iterations of multi-token, negative-lookahead...
multi-token: 0.779 wallclock secs (0.759 usr 0.033 sys 0.792 cpu) @ 12838.058/s (n=10000)
negative-lookahead: 0.912 wallclock secs (0.861 usr 0.048 sys 0.909 cpu) @ 10967.522/s (n=10000)
O--------------------O---------O-------------O--------------------O
|                    | Rate    | multi-token | negative-lookahead |
O====================O=========O=============O====================O
| multi-token        | 12838/s | --          | -20%               |
| negative-lookahead | 10968/s | 25%         | --                 |
O--------------------O---------O-------------O--------------------O

I'll be going with the "multi token" solution. Thanks everyone!

Actually in your update, you'll find you don't even need the `` token now. — user0721090601, Dec 06 '20 at 20:06
Fwiw, using `+` as the quantifier means the string must contain at least one character, so `''` or `""` won't parse. — raiph, Dec 06 '20 at 20:15
Your final update shows a classic trade off. The negative lookahead is definitely faster, but it's not as readable (it doesn't jump out immediately what's going on, although with about 15-20 seconds it's clear). The multi jumps out immediately what's going on (esp if you know what `~` does. Great job and welcome to Raku! — user0721090601, Dec 06 '20 at 22:18
You can eliminate the capturing associated with one or both of the `` assertions in the `multi-token` grammar by writing `<.sym>` instead of ``. — raiph, Dec 07 '20 at 00:06
@user0721090601 "The negative lookahead is definitely faster..." Looks slower to me and .@JustThisGuy. ;) "but it's not as readable (it doesn't jump out immediately what's going on, although with about 15-20 seconds it's clear)." Perhaps 30 seconds would be better? The `NegativeLookead` grammar will fail to parse a double quoted string that contains a single quote, and vice-versa. Methinks .@JustThisGuy has the right end result, and we might all best ignore how you and they got there. :) — raiph, Dec 08 '20 at 15:58

user0721090601 · Accepted Answer · 2020-12-06T20:07:04.917

There are a few different approaches that you can take — which one is best will probably depend on the rest of the structure you're employing.

But first an observation on your current solution and why opening it up to others won't work this way. Consider the string 'value". Should that parse? The structure you laid out actually would match it! That's because each <quote> token will match either a single or double quote.

Dealing with the inner

The simplest solution is to make your inner part a non-greedy wildcard:

<quote> (.*?) <quote>

This will stop the match as soon as you reach quote again. Also note the alternative syntax using a tilde that lets the two terminal bits be closer together:

<quote> ~ <quote> (.*?)

Your initial attempt wanted to use a sort of non-match. This does exist in the form of an assertion, <!quote> which will fail if a <quote> is found (which needn't be just a character, by any thing arbitrarily complex). It doesn't consume, though, so you need to provide that separately. For instance

[<!quote> .]*

Will check that something is NOT a quote, and then consume the next character.

Lastly, you could use either of the two approaches and use a <content> token that handles in the inside. This is actually a great approach if you intend to later do more complex things (e.g. escape characters).

Avoiding a mismatch

As I noted, your solution would parse mismatched quotes. So we need to have a way to ensure that the quote we are (not) matching is the same as the start one. One way to do this is using a multi token:

proto token attribute_value (|) { * }
multi token attribute_value:sym<'> { <sym> ~ <sym> <-[']> }
multi token attribute_value:sym<"> { <sym> ~ <sym> <-["]> }

(Using the actual token <sym> is not require, you could write it as { \' <-[']> \'} if you wanted).

Another way you could do this is by passing a parameter (either literally, or via dynamic variables). For example, you could make write the attribute_value as

token attribute_value {
    $<start-quote>=<quote>      # your actual start quote
    :my $*end-quote;            # define the variable in the regex scope
    { $*end-quote = ... }       # determine the requisite end quote (e.g. ” for “)
    <attribute_value_contents>  # handle actual content
    $*end-quote                 # fancy end quote
}

token attribute_value_contents {
    # We have access to $*end-quote here, so we can use
    # either of the techniques we've described before
    # (a) using a look ahead
    [<!before $*end-quote> .]*
    # (b) being lazy (the easier)
    .*?
    # (c) using another token (described below)
    <attr_value_content_char>+
}

I mention the last one because you can even further delegate if you ultimately decide to allow for escape characters. For example, you could then do

proto token attr_value_content_char (|) { * }
multi token attr_value_content_char:sym<escaped> { \\ $*end-quote }
multi token attr_value_content_char:sym<literal> { . <?{ $/ ne $*end-quote }> }

But if that's overkill for what you're doing, ah well :-)

Anyways, there are probably other ways that didn't jump to my mind that others can think of, but that should hopefully put you on the right path. (also some of this code is untested, so there may be slight errors, apologies for that)

They wrote "I want more that one type of quote. Something like this made up syntax that doesn't work: ... ` ($<-quote>) $`. Note they've used `` for the opener but `$<...>` for the negation and closer. This latter clearly conveys to me the notion that the closer is the *capture* of the opener, and thus the same as the opener. So I think "The structure you laid out actually would match" your `'value"` example is "unfair". :) — raiph, Dec 06 '20 at 10:23
`‍{ $/ ne $*end-‍quote }>` is better written as `<!after "$*end-‍quote">` — Brad Gilbert, Dec 06 '20 at 16:14
@BradGilbert Quite right! I knew someone would have a cleaner way to do that, I've updated my answer — user0721090601, Dec 06 '20 at 16:48
@raiph this is what I get for answering late at night lol, indeed I missed the `$` — user0721090601, Dec 06 '20 at 16:48
First of all, thanks for the detailed answer, and thank you @BradGilbert for the negative lookahead suggestion! The negative lookahead solution is exactly what I was looking for. That being said, I was not able to get the "multi token" solution to work. Please see Update for details. — JustThisGuy, Dec 06 '20 at 17:37
@raiph Indeed. Not sure if my sleep deprived brain forgot it or I was trying to keep the example code simpler. Maybe both. I've updated accordingly — user0721090601, Dec 06 '20 at 19:59

score 4 · Answer 2 · answered Dec 06 '20 at 16:19

Assuming that you just want to match the same quote character again.

token attribute-value { <string> }

token string {
  # match <quote> and expect to end with "$<quote>"
  <quote> ~ "$<quote>"

  [
    # update match structure in $/ otherwise "$<quote>" won't work
    {}

    <!before "$<quote>"> # next character isn't the same as $<quote>

    .    # any character

  ]*     # any number of times
}

token quote { <["']> }

For anything more complex use something like the $*end-quote dynamic variable from the earlier answer.

Thanks for the detailed explanation of the negative lookahead solution! — JustThisGuy, Dec 06 '20 at 17:54