5

Suppose you have a language which allows production like this: optional optional = 42, where first "optional" is a keyword, and the second "optional" is an identifier.

On one hand, I'd like to have a Lex rule like optional { return OPTIONAL; }, which would later be used in YACC like this, for example:

optional : OPTIONAL identifier '=' expression ;

If I then define identifier as, say:

identifier : OPTIONAL | FIXED32 | FIXED64 | ... /* couple dozens of keywords */ 
    | IDENTIFIER ;

It just feels bad... besides, I would need two kinds of identifiers, one for when keywords are allowed as identifiers, and another one for when they aren't...

Is there an idiomatic way to solve this?

wvxvw
  • 8,089
  • 10
  • 32
  • 61

3 Answers3

1

Is there an idiomatic way to solve this?

Other than the solution you have already found, no. Semi-reserved keywords are definitely not an expected use case for lex/yacc grammars.

The lemon parser generator has a fallback declaration designed for cases like this, but as far as I know, that useful feature has never been added to bison.

You can use a GLR grammar to avoid having to figure out all the different subsets of identifier. But of course there is a performance penalty.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Sir, I have a question. In case of lexing (custom, without regex engine). How do you lex an identifier like 'RETURNA'? If I run the lexer it will detect 'RETURN' as a keyword and the 'A' at the end as an identifier. It splits that word into keyword and identifier, however, I'm sure that it is an identifier. The order is ... -> Keywords -> Identifiers -> .. –  Oct 27 '21 at 15:50
  • @DickWilliams: Then you need to fix the lexer. (Or use a lexer generator, which is more efficient both for you and for your executable.) If you want to hand-build an efficient lexer, build a trie (which is effectively what the generator would do for you, saving you the trouble of making sure you covered all the cases). If you don't want to do that, then every keyword pattern needs to check that the first unmatched character isn't a valid identifier character (letters, digits, `_`, whatever else you allow.) – rici Oct 27 '21 at 16:03
  • Thank you sir! I'm building my own lexer because I want to learn how it works. Of course using existing tools is great too! –  Oct 27 '21 at 16:08
  • 1
    @DickWilliams: You won't learn how lexers work by building your own, because building a table-driven state machine --the most efficient way to write a lexical analyser-- is far too difficult and error-prone to do by hand. You're much better off learning how to *use* lexers, IMHO. The analogy I usually use is: would you learn anything about trigonometry by insisting on writing your own `sin` and `cos` functions using a Taylor expansion? It might be an interesting mathematical exercise but it won't help you write better graphics. – rici Oct 27 '21 at 16:12
  • 1
    The key to writing non-bloated software is crafting a good API, discarding features which are no longer needed (or were never necessary). Writing good interfaces is probably one of the most difficult design tasks; it requires a delicate balance between satisfying use cases (other than your own) and avoiding overengineering. The first step, always, is to try out different APIs, with a wide variety of different use cases. Also, don't immediately reject existing APIs; try to figure out the motivations and whether or not they worked. Implementation is the *last* thing to do, not the first. – rici Oct 27 '21 at 16:24
1

You've already discovered the most common way of dealing with this in lex/yacc, and, while not pretty, its not too bad. Normally you call your rule that matches an identifier or (set of) keywords whateverName, and you may have more than one of them -- as different contexts may have different sets of keywords they can accept as a name.

Another way that may work if you have keywords that are only recognized as such in easily identifiable places (such as at the start of a line) is to use a lex start state so as to only return a KEYWORD token if the keyword is in that context. In any other context, the keyword will just be returned as an identifier token. You can even use yacc actions to set the lexer state for somewhat complex contexts, but then you need to be aware of the possible one-token lexer lookahead done by the parser (rules might not run until after the token after the action is already read).

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • Sir, I have a question. In case of lexing (custom, without regex engine). How do you lex an identifier like `returna`? If I run the lexer it will detect 'return' as a keyword and the 'a' at the end as an identifier. It splits that word into keyword and identifier, however, I'm sure that it is an identifier. The order is ... -> Keywords -> Identifiers -> ... –  Oct 27 '21 at 15:46
  • @DickWilliams: lex will always match the *longest* lexeme that matches a pattern -- the order only matters when two patterns match the same length. So if you have a pattern that matches `return` and a (later) pattern that matches `returna`, the longer match will be matched. – Chris Dodd Oct 27 '21 at 16:08
1

This is a case where the keywords are not reserved. A few programming languages allowed this: PL/I, FORTRAN. It's not a lexer problem, because the lexer should always know which IDENTIFIERs are keywords. It's a parser problem. It usually causes too much ambiguity in the language specification and parsing becomes a nightmare. The grammar would have this:

identifier : keyword | IDENTIFIER ;

keyword : OPTIONAL | FIXED32 | FIXED64 | ... ;

If you have no conflicts in the grammar, then you are OK. If you have conflicts, then you need a more powerful parser generator, such as LR(k) or GLR.