What is this? Looking for the correct terminology what is going on here

Question

Looking at the following grammar which has an obvious flaw as far as parser generators are concerned:

"Start Symbol" = <Foo>
"Case Sensitive" = True
"Character Mapping" = 'Unicode'

{A} = {Digit}
{B} = [abcdefABCDEF]
{C} = {A} + {B}

Integer = {A}+
HexNumber = {C}+


<ContextA> ::= '[' HexNumber ']'
<ContextB> ::= '{' Integer '}'                      
<Number> ::= <ContextA> | <ContextB>
<Foo> ::= <Number> <Foo>
       | <>

The reason why this grammar is flawed, is, that the scanner cannot distinguish between the terminals [Integer;HexNumber]. (Is 1234 an integer or a hex number?!).

In the productions written in this example, this becomes irrelevant to bits, but there might be grammars, where the context of the productions would clarify if an integer or a hex number is expected and the scanner would still refuse to collaborate.

So, the scanner would need to know the parser state in order to be able to make the right decision as for the hex or integer token.

Now the question for the terminology. What does this make this ... errm... grammar? Lexer? then? A context sensitive lexer? Or would one say this is a context sensitive grammar, even though it is clearly a scanner problem? Is there other terminology used to describe such phenomena?

score 2 · Answer 1 · answered May 16 '15 at 18:04

Context sensitive means something quite different.

If you were to use a more formal notation, you'd see that your original grammar was ambiguous, as Ignacio Vazquez-Abrams said, and your edited grammar could be handled fine by an LR(1) (or even LL(1)) parser generator. Here is an unproblematic bison grammar:

%start number
%%
digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
hex   : digit
      | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' 
      | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
decnum: digit | decnum digit
hexnum: hex   | hexnum hex
number: '[' decnum ']'
      | '{' hexnum '}'

It's not usual to use bison to create a scanner, of course, but it is certainly possible.

I think the problem you are contemplating is this: if we build a scanner using flex, it would look like this:

[[:digit:]]+  { yylval.string = strdup(yytext); return DECNUM; }
[[:xdigit:]]+ { yylval.string = strdup(yytext); return HEXNUM; }

Flex cannot return an ambiguous token, so in the case where the (next part of the) input is 1234, flex needs to return either DECNUM or HEXNUM. The first longest ("maximal munch") rule means that which ever pattern comes first in the flex definition will win in the case of a token which could be parsed either way. That implies that the DECNUM pattern needs to come first, because otherwise it would be impossible for it to trigger (and flex will provide a warning in that case).

But now there is a minor problem for the grammar, because when the grammar is expecting a HEXNUM, it needs to be prepared to find a DECNUM. That's not a problem, provided the grammar is unambiguous. We only need create a couple of non-terminals:

decnum: DECNUM           { $$ = strtol($1, NULL, 10); free($1); }
hexnum: DECNUM | HEXNUM  { $$ = strtol($1, NULL, 16); free($1); }

That will not create an ambiguity or even a shift/reduce conflict which doesn't already exist in the grammar.

If you want to try this, you'll need to declare some types in your bison prolog:

%union {
   char* string;
   long  integer;
}
%token <string> HEXNUM DECNUM
%type <integer> hexnum decnum

The less formal notation I used just so happens to be the grammar GOLD parser uses. The benefit of GOLD is that you can write a grammar and directly test it without any coding. Your fix is interesting. Yet, this still keeps the question open, what the terminology for this would be: Scanner-hard? :) On a side note, I bumped into this problem while writing grammar for a PGN parser. Side note 2. I was slightly hoping someone would mention a parser generator system which uses a different approach to scanner-parser interaction. — BitTickler, May 16 '15 at 18:13
@BitTickler: I don't know enough about PGN to provide any sort of assistance. The alternative to parser/scanner architecture is the so-called scannerless model. Google will find you a zillion links, many of which consist of highly emotional arguments. (Personally, I'm not a fan, but whatever works.) Generally, scannerless parsers can be made to be unambiguous but they're rarely LR(1) and often not even LR(bounded), so in some contexts the terminology would be "not LR(k)" (i.e., still parseable with a GLR parser)... — rici, May 16 '15 at 18:42
@BitTickler: ... When you have trouble with the scanner/parser interface, it is often because what you have is not a simple language but rather a composite of languages (eg: regular expressions embedded in awk; the tokenization of regexes is not compatible with the tokenization of the surrounding language.). Language composition is another googleable term, and another instance where scannerless parsing is posited as a (or often "the") solution. The flex solution is start conditions, which is another composition technique. — rici, May 16 '15 at 18:45

score 0 · Answer 2 · answered May 16 '15 at 17:34

0

That grammar can be described as ambiguous.

answered May 16 '15 at 17:34

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Hm... given a grammar which does distinguish which token can occur (did not bother to write example in question), the grammar would not be ambigous and the only reason this would not work would be that there is a scanner used - in contrast to scanner-less parsing. Maybe I shoudl edit the grammar to make it more obvious. – BitTickler May 16 '15 at 17:36

What is this? Looking for the correct terminology what is going on here

2 Answers2