1

I'm attempting to write a grammar to parse numbers in English sentences, and I can successfully parse up to 999. Once I add in the logic to support the thousands place, I get a reduce parsing conflict, and I'm having a hard time understanding what's causing it.

I've attached a portion of the parser.out file generated by lemon, and I'm hoping someone can shed some light on the issue. I've also included a large portion of the grammar, where everything below the line works on it's own, but once I add in the logic for thousands above the line I begin running into issues.

My thought is that I'm running into an issue similar to the "dangling else" but with my separator. However, that normally manifests as a shift-reduce error, whereas it looks like I have just a reduce error. Lemon documentation is a bit sparse, and I'm not exactly sure how to read the parser.out file's contents. For example, in the line HYPHEN reduce 15 ** Parsing conflict **, what does the 15 even refer to?

Any help would be greatly appreciated!


Portion of my grammar file:

final_number(A) ::= one_to_999999(B).
final_number(A) ::= ZERO.

one_to_999999(A) ::= thousands(B) separator one_to_999(C).
one_to_999999(A) ::= thousands(B).
one_to_999999(A) ::= one_to_999(B).

thousands(A) ::= one_to_999(B) separator THOUSAND.
thousands(A) ::= THOUSAND.

/* -------------------------------------- */

one_to_999(A) ::= hundreds(B) separator one_to_99(C).
one_to_999(A) ::= hundreds(B).
one_to_999(A) ::= one_to_99(B).

one_to_99(A) ::= tens(B) separator one_to_9(C).
one_to_99(A) ::= tens(B).
one_to_99(A) ::= ten_to_19(B).
one_to_99(A) ::= one_to_9(B).

hundreds(A) ::= one_to_9(B) separator HUNDRED.
hundreds(A) ::= HUNDRED.

separator ::= WHITESPACE.
separator ::= HYPHEN.
separator ::= .

Portion of parser.out with an error:

State 5:
          one_to_99 ::= tens * separator one_to_9
     (15) one_to_99 ::= tens *
          separator ::= * WHITESPACE
          separator ::= * HYPHEN
     (65) separator ::= *

                             $ reduce       15     one_to_99 ::= tens
                      THOUSAND reduce       15     one_to_99 ::= tens
                    WHITESPACE shift-reduce 63     separator ::= WHITESPACE
                    WHITESPACE reduce       15      ** Parsing conflict **
                        HYPHEN shift-reduce 64     separator ::= HYPHEN
                        HYPHEN reduce       15      ** Parsing conflict **
                     separator shift        4      
                     {default} reduce       65     separator ::=
JesseBuesking
  • 6,496
  • 4
  • 44
  • 89

1 Answers1

1

There isn't actually quite enough information here to diagnose the complete problem, but I think I can fill in the blanks.

What's indicated is that the problem is a state where the parser has recognized tens (that would be "twenty", "thirty", ..., "ninety", right?) and it now needs a separator (which might be optional). If the lookahead token is an actual separator, it has to decide whether to reduce tens immediately to one_to_99 (as a prelude to completing one_to_999 without a trailing digit) or shift the WHITESPACE or HYPHEN character in order to extend tens with a separator and a single digit (one_to_9).

The parser really cannot make that decision just looking at the separator token. It needs to know what follows (which might be, for example, THOUSAND or ONE, amongst other possibilities).

This doesn't happen before you add thousands to the grammar, because without the possibility of THOUSAND, if there is no single digit at the end of the number there is no separator following the tens token either. So if there is an explicit separator, then there must be a digit and thus the shift is required. Once you add the THOUSAND option, the existence of the separator token is no longer an adequate guide.

Trying to explicitly match whitespace in a parser is similar to what is commonly known as "scannerless parsing", although that's not strictly the case here since you probably do actually have a scanner. However, the scanner is not doing its job properly; it is failing to drop tokens which have no syntactic value. While there are people who like scannerless parsing, it is generally agreed that it tends to increase lookahead requirements. [Note 1] Since you cannot increase the lookahead for a lemon parser (nor for many other yacc-based parser generators), scannerless parsing is problematic with such tools.

In this case, it is hard to see what you might gain by forcing the parser to deal with separators, and it is apparent what you have lost (LALR(1) parsability), so I'd suggest that you just drop whitespace and hyphens on the floor in your scanner, and remove them from your parser. You might argue that doing that will allow incorrect sentences such as three hundred forty---two. That's true, but your current grammar allows three hundred-forty two (which is not correct in any style guide I've ever seen), and may forbid forty - two, depending on what pattern your scanner uses to recognize hyphens.

If you want to be "hyphen-correct", by all means return hyphens (but not whitespace) from your scanner, and then accept them only where useful:

one_to_99 ::= tens
            | tens one_to_9
            | tens HYPHEN one_to_9
            ;

which will not produce any shift/reduce conflicts.

Notes

  1. I'm not one of the people who likes scannerless parsing, so I won't even attempt to explain why it is thought to be a good idea.
Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
  • I'm at work now, but I'll read and respond to your comment tonight. I did notice you mentioned at first that there's not quite enough information to go on, so I apologize in advance and I'll attempt to supply anything you might find helpful. Thank you! – JesseBuesking Nov 07 '16 at 22:47
  • Unfortunately the numbers may appear mid-sentence, e.g. `I am thirty three years old`, which would output as `I am 33 years old`. I was attempting to preserve whitespace so that I could reconstruct the sentence. I thought of a workaround for this which is to track starting and ending token locations, that way I can know that `33` appears between locations 5 and 17, and then I can rebuild the sentence inserting the numbers. Does that sound like a good approach, or does knowing what I'm trying to do lead you to a better solution? – JesseBuesking Nov 08 '16 at 04:29
  • Also, do the asterisks in the parser.out file represent the current location of the parser? I'm still confused about what the numbers are attempting to represent. – JesseBuesking Nov 08 '16 at 04:31
  • @jesse, yes the asterisks indicate the "point" in each production corresponding to the current input location. The numbers are either the next state (shift/goto) or the production number (reduce). – rici Nov 08 '16 at 06:11