How to handle x*, x+, or x? regex-like operators in an LR parser?

Question

I have implemented recursive descent and PEG-like parsers in the past, where you could do things like this:

Path -> Segment+
Segment -> Slash Name
Segment -> /
Name -> /\w+/
Slash -> /

where Segment+ means "match one or more Segment"
and there's a plain old regular expression for matching one or more word characters with \w+

How do you typically accomplish this same sort of thing with LR grammars/parsers? All of the examples of LR parsers I have seen are very basic, such as parsing 1 + 2 * 3, or (())(), where the patterns are very simple and don't seem to involve "one or more" functionality (or zero or more with *, or optional with ?). How do you do that in an LR parser generally?

Or does LR parsing require a lexing phase first (i.e. an LR parser requires terminal and nonterminal "tokens"). Hoping that there is a way to do LR parsing without two phases like that. The definition of an LR parser talks about "input characters" in the books/sites I've been reading, but then you see casually/subtly a line like:

The grammar's terminal symbols are the multi-character symbols or 'tokens' found in the input stream by a lexical scanner.

And it's like what, where did the scanner come from.

score 2 · Answer 1 · answered Apr 26 '15 at 06:19

You can certainly write a scannerless grammar for a language, but in most cases it won't be LR(1), because 1 token of lookahead isn't much when the token is a single character.

Generally, LALR(1) parser generators (like bison) are used in conjunction with a scanner generator (like flex).

Ira Baxter · Answer 2 · 2015-06-19T14:41:30.590

Whenever you have a parser operating on a stream of tokens, there is always the question of what produced the stream of tokens. With most parser generators, the grammar specification and the lexical specification of tokens are kept separate, mostly because the way the parser generator and lexer generator operate are different.

Adding regex operators to "the grammar" is convenient. But they do not extend the power of context free grammars.

You have 3 choices for using regex-like operators in grammars.

1) Use them at the character level consistently across the grammar. If you do this, your parser operates with tokens being individual characters. This is likely to work badly with most parser generators, because the decision for most of them is based on the next input stream token only, in this case, a single character. To make this work you either need a backtracking parser or one that will try multiple paths through the space of alternative parses; LR and LL parsers don't do this. There are scannerless GLR parsers for which this would work, if you don't mind the additional overhead of GLR on a per character basis. (Also, if operating at the character level, you are likely to have explicitly specify whitespace and comment syntax).

2) Use them as specifications of individual token character sequences (as in OP's "Name -> /w+/"). In this form, what you end up doing is writing a lexical token specifications integrated with the grammar. The grammar can be then processed into two parts: the lexical specification, and a more conventional BNF. This idea is compatible with many lexer and parser generators; it still doesn't change the expressive power.

3) Use the regex operators only on grammar elements. These are are easily transformed into conventional BNF grammar rules:

  Path -> Segment +

is equivalent to:

  Path -> Segment 
  Path -> Path Segment

After such transformations the results are compatible with most parser generators. This leaves open how the lexical syntax of grammar tokens are specified.

You can implement a hybrid scheme combining 1) and 2), which is appears to be what OP has done.

score 1 · Answer 3 · 2019-05-21T15:11:46.187

Regular expressions are allowed in the BNF grammars of some LR parser generators. LRSTAR 9.1 allows the use of regular expressions. Yacc and Bison do not allow regular expressions in BNF grammars.

Lexical tokens should be specified in a lexical grammar, separate form the BNF (parser) grammar. Having the tokens already defined before the parser sees them makes life a lot easier for the parser.

PEG's don't force you to separate the lexical symbols from the grammar symbols, thereby creating some serious problems, such as requiring infinite lookahead in some cases.

Chuckov · Answer 4 · 2022-08-02T15:44:39.540

There's a fundamental difference between regular expressions and context free grammars. Regex can be freely composed: you can write *, +, ? in any combination, such as (a*a+)*, and we know we can create a deterministic state machine for it in the end. But not so with grammars. It's not hard to create a "regular grammar" to recognize a*, for example: Astar --> a Astar | null. But you can't compose them: S --> Astar Astar is hopelessly ambiguous: there are n+1 distinct parse trees for an input of a^n.

Of course, you could sweep the fundamental problem underneath the rug and hack a parser out of it, but that approach is like running a program that should not run (for x in A: A.append(x)). An ambiguous grammar should not result in a deterministic parser. You can add the ability to recognize *, + and ? to any LR parser generator in a naive way but using them may lead to even more ambiguities (shift-reduce and reduce-reduce conflicts).

How to handle x*, x+, or x? regex-like operators in an LR parser?

4 Answers4