3

I am working on a Marpa::R2 grammar that groups items in a text. Each group can only contain items of a certain kind, but is not explicitly delimited. This causes problems, because x...x (where . represents an item that can be part of a group) can be grouped as x(...)x, x(..)(.)x, x(.)(..)x, x(.)(.)(.)x. In other words, the grammar is highly ambiguous.

How can I remove this ambiguity if I only want the x(...)x parse, i.e. if I want to force a + quantifier to only behave “greedy” (as it does in Perl regexes)?

In the below grammar, I tried adding rank adverbs to the sequence rules in order to prioritize Group over Sequence, but that doesn't seem to work.

Below is a test case that exercises this behaviour.

use strict;
use warnings;

use Marpa::R2;
use Test::More;

my $grammar_source = <<'END_GRAMMAR';
inaccessible is fatal by default
:discard ~ space
:start ::= Sequence

Sequence
    ::= SequenceItem+  action => ::array
SequenceItem
    ::= WORD    action => ::first
    |   Group   action => ::first
Group
    ::= GroupItem+  action => [name, values]
GroupItem
    ::= ('[') Sequence (']')  action => ::first

WORD    ~ [a-z]+
space   ~ [\s]+
END_GRAMMAR

my $input = "foo [a] [b] bar";

diag "perl $^V";
diag "Marpa::R2 " . Marpa::R2->VERSION;

my $grammar = Marpa::R2::Scanless::G->new({ source => \$grammar_source });
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });

$recce->read(\$input);

my $parse_count = 0;
while (my $value = $recce->value) {
    is_deeply $$value, ['foo', [Group => ['a'], ['b']], 'bar'], 'expected structure'
        or diag explain $$value;
    $parse_count++;
}
is $parse_count, 1, 'expected number of parses';

done_testing;

Output of the test case (FAIL):

# perl v5.18.2
# Marpa::R2 2.09
ok 1 - expected structure
not ok 2 - expected structure
#   Failed test 'expected structure'
#   at - line 38.
#     Structures begin differing at:
#          $got->[1][2] = Does not exist
#     $expected->[1][2] = ARRAY(0x981bd68)
# [
#   'foo',
#   [
#     'Group',
#     [
#       'a'
#     ]
#   ],
#   [
#     ${\$VAR1->[1][0]},
#     [
#       'b'
#     ]
#   ],
#   'bar'
# ]
not ok 3 - expected number of parses
#   Failed test 'expected number of parses'
#   at - line 41.
#          got: '2'
#     expected: '1'
1..3
# Looks like you failed 2 tests of 3.
Miller
  • 34,962
  • 4
  • 39
  • 60
amon
  • 57,091
  • 2
  • 89
  • 149

2 Answers2

4

Sequence rules are designed for non-tricky cases. Sequence rules can always be rewritten as BNF rules when the going gets tricky, and that is what I suggest here. The following makes your test work:

use strict;
use warnings;

use Marpa::R2;
use Test::More;

my $grammar_source = <<'END_GRAMMAR';
inaccessible is fatal by default
:discard ~ space

# Three cases
# 1.) Just one group.
# 2.) Group follows by alternating words and groups.
# 3.) Alternating words and groups, starting with words
Sequence ::= Group action => ::first
Sequence ::= Group Subsequence action => [values]
Sequence ::= Subsequence action => ::first

Subsequence ::= Words action => ::first

# "action => [values]" makes the test work unchanged.
# The action for the next rule probably should be
# action => [name, values] in order to handle the general case.
Subsequence ::= Subsequence Group Words action => [values]

Words ::= WORD+ action => ::first
Group
::= GroupItem+  action => [name, values]
GroupItem
::= ('[') Sequence (']')  action => [value]

WORD    ~ [a-z]+
space   ~ [\s]+
END_GRAMMAR

my $input = "foo [a] [b] bar";

diag "perl $^V";
diag "Marpa::R2 " . Marpa::R2->VERSION;

my $grammar = Marpa::R2::Scanless::G->new( { source  => \$grammar_source } );
my $recce   = Marpa::R2::Scanless::R->new( { grammar => $grammar } );

$recce->read( \$input );

my $parse_count = 0;
while ( my $value = $recce->value ) {
is_deeply $$value, [ 'foo', [ Group => ['a'], ['b'] ], 'bar' ],
    'expected structure'
    or diag explain $$value;
$parse_count++;
} ## end while ( my $value = $recce->value )
is $parse_count, 1, 'expected number of parses';

done_testing;
Jeffrey Kegler
  • 841
  • 1
  • 6
  • 8
  • 1
    Thank you. I guess I'll then write recursive rules, and flatten the parse tree via custom actions. I just wanted first to be sure that there wasn't a more efficient way I had simply overlooked. FYI, I stumbled upon this problem while trying to parse lists in a Markdown-like language, i.e. doing indentation-sensitive parsing. It's not something Marpa excels at, but it's surprisingly easy with a few events (though there may be a bug in the event system regarding zero-L0-length externally read lexemes which I have to investigate a bit) – amon Aug 31 '14 at 18:36
3

Unabiguous grammar:

Sequence           : WORD+ SequenceAfterWords
                   | Group SequenceAfterGroup

SequenceAfterWords : Group SequenceAfterGroup
                   |

SequenceAfterGroup : WORD+ SequenceAfterWords
                   |

Jeffrey Kegler says that leading with the recursion is handled more efficiently in Marpa. The same approach taken above can be taken back to front to produce this.

Sequence            : SequenceBeforeWords WORD+
                    | SequenceBeforeGroup Group

SequenceBeforeWords : SequenceBeforeGroup Group
                    |

SequenceBeforeGroup : SequenceBeforeWords WORD+
                    |

In both cases,

Group     : GroupItem+

GroupItem : '[' Sequence ']'
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • An very elegant approach. Mine is left recursive because left recursion is slightly better for the Marpa internals than right recursion -- both are linear if unambiguous, but right recursion requires a bit more bookkeeping. – Jeffrey Kegler Sep 01 '14 at 04:51
  • @Jeffrey Kegler, What about the fact that you have productions of the same rule that have a common start. Does Marpa handle those efficiently? – ikegami Sep 01 '14 at 04:53
  • You mean rules with the same LHS. Yes, Marpa has no problem with these. – Jeffrey Kegler Sep 01 '14 at 04:54
  • I note the difference in the internals between left & right recursion is not a big deal (except if the right recursion is ambiguous) -- so the more elegant solution (yours) may be the way to go here. – Jeffrey Kegler Sep 01 '14 at 04:57
  • ikegami -- I'm trying to use the at-sign in my replies, but for some reason stackoverflow is eating them. – Jeffrey Kegler Sep 01 '14 at 05:01
  • @Jeffrey Kegler, If you reply to a Q or A, the writer of the Q or A gets notified automatically. – ikegami Sep 01 '14 at 05:03
  • @Jeffrey Kegler, I meant rules of the form `A ::= B C | B D`. – ikegami Sep 01 '14 at 05:04
  • I know some interfaces require you to group all the alternatives for a LHS into one rule, but Marpa (following Earley's) does not. Doing alternatives as separate rules is common in older papers & textbooks. – Jeffrey Kegler Sep 01 '14 at 05:12
  • 1
    @Jeffrey Kegler, Argh, no. Is the common start `B` in both productions of `A` in `A ::= B C | B D` handled efficiently in Marpa? – ikegami Sep 01 '14 at 05:24
  • ikegami -- sorry to have missed the point 1st time. Yes, `A ::= B C | B D` is handled very efficiently. To see how it works, look up Earley's -- Marpa's efficient handling of this construct comes straight out of Earley's old algorithm. – Jeffrey Kegler Sep 01 '14 at 09:30
  • 1
    @ikegami Yes, Earley parsers such as Marpa handle such a common start efficiently. When `A` is predicted, this predicts the productions `A → · B C` and `A → · B D` (`·` is the current location). So both productions predict `B`. After `B` has been completed, we are at `A → B · C` and `A → B · D` , which predicts `C` and `D`. If neither is found, `A` is removed from the current productions. If either is found, that production and `A` are completed. If both are found, both productions are completed and we have an ambiguous parse. This prediction-based table parsing implies no backtracking. – amon Sep 01 '14 at 09:32
  • 1
    However, that rule `A` always adds two productions to the set of current productions. Keeping this set small is beneficial for performance, as it will have to be iterated after every token to generate new predictions. A variant such as ` ::= ; ::= | ` will keep the set smaller if `B` isn't found, but adds the overhead of another rule. I'd rather not micro-optimize my grammars on this level, as optimizing the Perl callbacks of a Marpa parser has a much higher ROI. – amon Sep 01 '14 at 09:33
  • @amon -- Such micro-optimizations are, I believe, pointless or even counter-productive. To get very detailed, internally Marpa *does* rewrite rules for efficiency. This is *not* one of its rewrites, because there's not sufficient gain. Sometimes overemphasis on reducing the number of productions in the Earley tables forces higher costs elsewhere. – Jeffrey Kegler Sep 01 '14 at 09:45
  • Made a small simplification. – ikegami Sep 01 '14 at 16:52