7

I am having problems with this mini-grammar, which tries to match markdown-like header constructs.

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}
}

I would like it to match ## Easier ## as a header, but instead it takes ## as part of span:

TOP
|  header
|  |  hashes
|  |  * MATCH "##"
|  |  span
|  |  |  like-a-word
|  |  |  * MATCH "Easier"
|  |  |  like-a-word
|  |  |  * MATCH "##"
|  |  |  like-a-word
|  |  |  * FAIL
|  |  * MATCH "Easier ##"
|  * MATCH "## Easier ##"
* MATCH "## Easier ##\n"
「## Easier ##
」
 header => 「## Easier ##」
  hashes => 「##」
  span => 「Easier ##」
   like-a-word => 「Easier」
   like-a-word => 「##」

The problem is that the [\h* $0]? simply does not seem to work, with span gobbling up all available words. Any idea?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
jjmerelo
  • 22,578
  • 8
  • 40
  • 86
  • Try `?` after `*`. – Rahul Jan 05 '18 at 09:03
  • If you mean in the Span definition, I did. Same result. – jjmerelo Jan 05 '18 at 09:06
  • 3
    Probably I don't understand something, but what would you expect `$0` to match, when there are no positional captures? – Eugene Barsky Jan 05 '18 at 09:39
  • 1
    That is absolutely right. It was a left over from old attempts. – jjmerelo Jan 05 '18 at 09:57
  • Fyi: You should **always** use `token` unless you *know* you need one of the other options (`regex`, `rule`, and `method`). – raiph Jan 06 '18 at 18:05
  • 1
    Only use `regex` if you are *sure* you need *backtracking*, because backtracking will frequently unnecessarily make parsing run literally millions of times slower than necessary, or worse. If you switch all your `regex` declarations to `token` you'll see that your code will continue to parse correctly (at least for your trial input "## Easier ##\n") but will quite plausibly run vastly faster on large or complex inputs. – raiph Jan 06 '18 at 18:05
  • I think I need backtracking here. "## Easy Peasy ##" will fail, for instance. I can change the lowest level, `like-a-word`, to a token, though. – jjmerelo Jan 08 '18 at 06:13

3 Answers3

5

First, as others have pointed out, <hashes> does not capture into $0, but instead, it captures into $<hashes>, so you have to write:

regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}

But that still doesn't match the way you want, because the [\h* $<hashes>]? part happily matches zero occurrences.

The proper fix is to not let span match ## as a word:

role Like-a-word {
    regex like-a-word { <!before '#'> \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}
}

say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);

If you are loath to modify like-a-word, you can also force the exclusion of a final # from it like this:

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> <!after '#'> [\h* $<hashes>]? $$}
}

say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);
moritz
  • 12,710
  • 1
  • 41
  • 63
  • It's fine, except that I might want to capture `#` in "like-a-word". Markdown can't fail, so if I have something like `### not a header ##` I would want `##` be interpreted `like-a-word`. So I guess the first one is fine, but leaving `like-a-word` just the way it was. Thanks a lot! – jjmerelo Jan 06 '18 at 08:11
  • 2
    @jjmerelo Food for thought / tests: What's supposed to happen with `## two hashes on left, three on right ###`; and `## two hashes on left, two plus two on right ## ##`; and `## two hashes on left, two plus three on right ## ###`; and `## two hashes on left, three plus two on right ### ##`; and `## ## two plus two hashes on left, two plus two on right ## ##`; and `## two hashes on left, two plus two on right ## ##`; and `## two hashes on left, two in middle ## and some more text ##`;? – raiph Jan 06 '18 at 18:52
  • 2
    @jjmerelo you can always try to do the stricter parse first, and then use `||` to fall back to something that always matches. – moritz Jan 06 '18 at 19:26
4

Just change

  regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}

to

  regex header {^^ (<hashes>) \h+ <span> [\h* $0]? $$}

So that the capture works. Thanks to Eugene Barsky for calling this.

jjmerelo
  • 22,578
  • 8
  • 40
  • 86
3

I played with this a bit because I thought there were two interesting things you might do.

First, you can make hashes take an argument about how many it will match. That way you can do special things based on the level if you like. You can reuse hashes in different parts of the grammar where you require different but exact numbers of hash marks.

Next, the ~ stitcher allows you to specify that something will show up in the middle of two things so you can put those wrapper things next to each other. For example, to match (Foo) you could write '(' ~ ')' Foo. With that it looks like I came up with the same thing you posted:

use Grammar::Tracer;

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* }
}

grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes ( $n = 1 ) { '#' ** {$n} }

    regex header { [(<hashes(2)>) \h*] ~ [\h* $0] <span>  }
}

my $result = Grammar::Headers.parse( "## Easier ##\n" );

say $result;
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • Thanks for the answer. I wonder how hashes will show up in the Match object. Plus, will I need to declare also header in the same way, using $n as a parameter? – jjmerelo Jan 06 '18 at 17:31
  • I think you could declare header to take a parameter and then pass that on to something below it. However, I'd probably lean towards making header1, header2, and so on. That might make the AST easier when you want to play with it. But, I hadn't thought that long on it. :) – brian d foy Jan 07 '18 at 23:50