keywords and identifiers conflict writing lexer? (scala libs)

Question

I've tried fastparse, parboiled2 and scala-combinators. They all have this problem when defining a LEXER:

LET_KEYWORD ::= "let"
IDENTIFIER  ::= "[a-zA-Z]+".r

When I run them against input "leto" they produce [LET_KEYWORD,IDENTIFIER(o)].

I'd expect some of those libraries to give me a behaviour like this:

if the input is "let" then it resolves the ambiguity by choosing the first defined rule because it's the more relevant. if the input is "leto" then there no ambiguity and produces only IDENTIFIER(leto). That's the behaviour described here, in the ANTLR

score 1 · Answer 1 · answered Dec 31 '18 at 10:19

Here is snippet from my code

val identifierOrKeyword = letter ~ rep(letter | digit | '_') ^^ {
  case x ~ xs =>
    val ident = x :: xs mkString ""
    keyword.getOrElse(ident.toLowerCase, IDENTIFIER(ident))
}

keyword is map from string to token.

Used definitions:

sealed trait SqlToken
object SqlToken {
  case class IDENTIFIER(value: String) extends SqlToken
  case object LET extends SqlToken
}

val keyword = Map(
    "let" -> LET
}

score 1 · Answer 2 · answered Dec 31 '18 at 11:18

1

Your situation is not comparable to the ANTLR situation in which the lexer is staged before the parser. In that situation you see that the longest match rule of the lexer takes precedence simply because it is executed first, producing the only token that the parser can then consume.

In your case, with the parsing technologies you used, they execute the regular expressions "on demand" in the context of the current non-terminal you are trying to recognize. This makes the choice between the two different lexical interpretations bubble up to a context-free choice. You have to wire that choice into your definitions.

I'd guess that the order of the rules in the source code is not relevant for these technologies, you'd have to use a declarative ordered choice somewhere (not the |), or rewrite the grammar to not be ambiguous anymore .

answered Dec 31 '18 at 11:18

Jurgen Vinju

6,393
1
15
26

I'm actually staging the lexer before the parser. Letting it produce a sequence of tokens, regardless of its order. But if I get "leto" I want it to produce only one Identifier token. – caeus Dec 31 '18 at 12:42
Then could you force longest match, maximal munch. behavior from the lexer? Happy New year BTW – Jurgen Vinju Jan 01 '19 at 08:22
Some lexers always prefer keyword literals over identifiers, some take the order of the rules as priority declaration. – Jurgen Vinju Jan 01 '19 at 08:33

keywords and identifiers conflict writing lexer? (scala libs)

2 Answers2