6

Reposted from the #perl6 IRC channel, by jkramer, with permission

I'm playing with grammars and trying to parse an ini-style file but somehow Grammar.parse seems to loop forever and use 100% CPU. Any ideas what's wrong here?

grammar Format {
  token TOP {
    [
      <comment>*
      [
        <section>
        [ <line> | <comment> ]*
      ]*
    ]*
  }

  rule section {
    '[' <identifier> <subsection>? ']'
  }

  rule subsection {
    '"' <identifier> '"'
  }

  rule identifier {
    <[A..Za..z]> <[A..Za..z0..9_-]>+
  }

  rule comment {
    <[";]> .*? $$
  }

  rule line {
    <key> '=' <value>
  }

  rule key {
    <identifier>
  }

  rule value {
    .*? $$
  }
}

Format.parse('lol.conf'.IO.slurp)
jjmerelo
  • 22,578
  • 8
  • 40
  • 86

1 Answers1

7

Token TOP has the * quantifier on a subregex that can parse an empty string (because both <comment> and the group that contains <section> have a * quantifier on their own).

If the inner subregex matches the empty string, it can do so infinitely many times without advancing the cursor. Currently, Perl 6 has no protection against this kind of error.

It looks to me like you could simplify your code to

token TOP {
  <comment>*
  [
    <section>
    [ <line> | <comment> ]*
  ]*
}

(there is no need for the outer group of [...]*, because the last <comment> also matches comments before sections.

moritz
  • 12,710
  • 1
  • 41
  • 63
  • 3
    Shouldn't you also use `token` instead of `rule` here? For example, the spaces in `rule comment { <[";]> .*? $$ }` could gobble up newline characters before we reach the `$$` or am I wrong? – Håkon Hægland Apr 12 '18 at 16:31
  • 1
    If vertical whitespace is significant, as suggested by use of `$$`, then it would be sensible to override `token ws { <!ww> \h* }` to match only horizontal whitespace. Much more on that, and two working grammars for INI files can be found in https://smile.amazon.com/Parsing-Perl-Regexes-Grammars-Recursive-ebook/dp/B0785L6WR6 – moritz Apr 13 '18 at 09:23