How can a lexer extract a token in ambiguous languages?

Question

I wish to understand how does a parser work. I learnt about the LL, LR(0), LR(1) parts, how to build, NFA, DFA, parse tables, etc.

Now the problem is, i know that a lexer should extract tokens only on the parser demand in some situation, when it's not possible to extract all the tokens in one separated pass. I don't exactly understand this kind of situation, so i'm open to any explanation about this.

The question now is, how should a lexer does its job ? should it base its recognition on the current "contexts", the current non-terminals supposed to be parsed ? is it something totally different ? What about the GLR parsing : is it another case where a lexer could try different terminals, or is it only a syntactic business ? I would also want to understand what it's related to, for example is it related to the kind of parsing technique (LL, LR, etc) or only the grammar ?

Thanks a lot

Ira Baxter · Accepted Answer · 2012-07-21T14:01:15.233

6

The simple answer is that lexeme extraction has to be done in context. What one might consider be lexemes in the language may vary considerably in different parts of the language. For example, in COBOL, the data declaration section has 'PIC' strings and location-sensitive level numbers 01-99 that do not appear in the procedure section.

The lexer thus to somehow know what part of the language is being processed, to know what lexemes to collect. This is often handled by having lexing states which each process some subset of the entire language set of lexemes (often with considerable overlap in the subset; e.g., identifiers tend to be pretty similar in my experience). These states form a high level finite state machine, with transitions between them when phase changing lexemes are encountered, e.g., the keywords that indicate entry into the data declaration or procedure section of the COBOL program. Modern languages like Java and C# minimize the need for this but most other languages I've encountered really need this kind of help in the lexer.

So-called "scannerless" parsers (you are thinking "GLR") work by getting rid of the lexer entirely; now there's no need for the lexer to produce lexemes, and no need to track lexical states :-} Such parsers work by simply writing the grammar down the level of individual characters; typically you find grammar rules that are the exact equivalent of what you'd write for a lexeme description. The question is then, why doesn't such a parser get confused as to which "lexeme" to produce? This is where the GLR part is useful. GLR parsers are happy to process many possible interpretations of the input ("locally ambiguous parses") as long as the choice gets eventually resolved. So what really happens in the case of "ambiguous tokens" is the the grammar rules for both "tokens" produce nonterminals for their respectives "lexemes", and the GLR parser continues to parse until one of the parsing paths dies out or the parser terminates with an ambiguous parse.

My company builds lots of parsers for languages. We use GLR parsers because they are very nice for handling complex languages; write the context-free grammar and you have a parser. We use lexical-state based lexeme extractors with the usual regular-expression specification of lexemes and lexical-state-transitions triggered by certain lexemes. We could arguably build scannerless GLR parsers (by making our lexers produce single characters as tokens :) but we find the efficiency of the state-based lexers to be worth the extra trouble.

As practical extensions, our lexers actually use push-down-stack automata for the high level state machine rather than mere finite state machines. This helps when one has high level FSA whose substates are identical, and where it is helpful for the lexer to manage nested structures (e.g, match parentheses) to manage a mode switch (e.g., when the parentheses all been matched).

A unique feature of our lexers: we also do a little tiny bit of what scannerless parsers do: sometimes when a keyword is recognized, our lexers will inject both a keyword and an identifier into the parser (simulates a scannerless parser with a grammar rule for each). The parser will of course only accept what it wants "in context" and simply throw away the wrong alternative. This gives us an easy to handle "keywords in context otherwise interpreted as identifiers", which occurs in many, many languages.

edited Jul 21 '12 at 14:01

answered Jul 21 '12 at 13:51

Ira Baxter

93,541
22
172
341

When you're talking about a scannerless parser you say that they essentially work by doing a search on multiple productions to see which one is valid. Wouldn't a scannerless parser also have the benefit of 'knowing' what rule it's in which would resolve a lot of that ambiguity? I.e. if the parser is parsing an assignment then it knows it's looking for an identifier and not a keyword, right? – doliver Aug 28 '14 at 16:03
The trouble with a scannerless parser is that, even if it "knows" it is processing an assignment, the notion of "keyword" vs "identifier" is "just more rules". So which ones can it inspect to decide? *If* in can decide which grammar rules (keyword vs. identifier) are relevant, than in fact you can make a conventional scanner that looks only for keywords, or only for identifiers and so you don't need a scannerless parser there. The place a scannerless parser is of real help is where it can't know until it reads a lot more input whether something is an identifier or not. ... – Ira Baxter Aug 28 '14 at 16:31
.... Here's an example from Fortran. We all know that "I1" is an identifier, right (you're going to into a trap here). Now, lets examine the Fortran code "17 FORMAT(X2,I1)". Clearly X2 and I1 are identifiers, right, and FORMAT is a keyword, right? OK, now let extend the code: "17 FORMAT(X2,I1)=expression", ok, yep they are as expected. What about "17 FORMAT(X2,I1)"? Oops, they aren't identifiers, and FORMAT is a keyword. A scannerful parser has a very hard time with this (and most of them cheat to handle it). ... – Ira Baxter Aug 28 '14 at 16:36
... A scannerless parser simple parses "FORMAT" as *both* a keyword and an identifier (by virtue of its character-level grammar rules), and treats I1 and X2 the same way. Eventually it encounters the differentiating text at the end of the statement,and rejects the parses that don't work. You have to agree that's fairly pretty. So why don't *we* do this (I have FORTRAN parsers). The answer is that our scannerful scanners can, when it sees an specific peculiar identifier like-thing (especially ambiguous ones like FORMAT and I1), inject both a keyword and an identifier... – Ira Baxter Aug 28 '14 at 16:39
... and our GLR-parser-operating-on-tokens can then resolve things just fine, the same way the scannerless one did. The difference is that our scanners are efficient the same way conventional lexical scanners are (because they are in effect conventional), and do efficient token recognition without dragging the parser into the process. The scannerless one succeeds, too, but now you have the GLR parser (handling multiple parallel value parse prefixes) doing a lot of work *per character*, all the time --> slow. "Doctor, doctor, it hurts when I do X". OK, "Don't do X". – Ira Baxter Aug 28 '14 at 16:43
"If in can decide which grammar rules (keyword vs. identifier) are relevant, than in fact you can make a conventional scanner that looks only for keywords, or only for identifiers and so you don't need a scannerless parser there." You mean make a conventional scanner and look only for keywords or identifiers based on a lexing state? What I'm talking about is if you having something like `->function ()` then after reading `"function "` it starts reading ``. The identifier could be a keyword but it knows it read function so it knows to invoke – doliver Aug 28 '14 at 17:45
.. `consume_identifier()` so it doesn't have to worry about trying to figure out if the next token is a keyword. As I understand it, if you had a scanner then the alternative would be to grab the state from the parser to get the same information. Is that correct? – doliver Aug 28 '14 at 17:46
So a lexer (including ours) can keep track of the apparant state of the parser by a) simulating where the parser is abstractly, or b) *asking* the parser if it needs to know. Given that information, it can decide what lexing state to "be" in. We have the additional flexibility of inserting multiple tokens as the "next" token, which allows us to avoid building lexical states in many cases. In the FORTRAN case, we have some lexical states that when "FORMAT" is seen as a string of characters, generates both a FORMAT keyword token *and* an Identifier token (with identifier name being "FORMAT"). – Ira Baxter Aug 28 '14 at 17:49
... we use all of those tricks. (The "track the parser state" can be done many ways, one of which is what you are doing in your example if the lexer see the "function" keyword and transits to a state in which it is only willing to look for an identifier) – Ira Baxter Aug 28 '14 at 17:57
I think I understand what you're saying except for this, "The scannerless one succeeds, too, but now you have the GLR parser (handling multiple parallel value parse prefixes) doing a lot of work per character". It doesn't seem like it would have to dig through the parallel prefixes most of the time because the parser can track state in the same way that the lexer would to know what kind of tokens are appropriate. In particular if the lexing state is triggered by the current rule then it should happen automatically, right? – doliver Aug 28 '14 at 17:57
The point is that the GLR parser goes through the process of managing its psuedo-parallel "parsers" for each input element. The question is, do you want to pay that price for each input *character*, or each input *token*? If you believe that tokens average N characters in length, then the character processing version is paying N times the cost to do its job. Even N=1.3 makes a significant difference. Its worse than that: scannerless parsers have to process whitespace and comments, which make up a significant bulk of the input stream in practice, and these tend to be long --> big N. – Ira Baxter Aug 28 '14 at 18:02
From wikipedia "The time required to run the algorithm is proportional to the degree of nondeterminism in the grammar: on deterministic grammars the GLR algorithm runs in O(n) time" is what I'm confused about. In other words, parallel processing only actually happens when the input is non deterministic and the individual characters never will be, right? – doliver Aug 28 '14 at 18:09
Well, GLR does its psuedo parallel processing just with *one* psuedo-thread when the grammar (part it is processing) is deterministic. The difference when it has more than one choice is it simply tracks N psuedo-parallel branchs (which may go up or down depending on how many possibilities the grammar offers as tokens are shifted). So it is the *same* GLR code that processes N as processes 1 (you can build faster GLR parsers that distinguish these cases, and that's generally worth the trouble). ... – Ira Baxter Aug 28 '14 at 20:36
2

... In any case, you are thowing the heavyweight GLR machinery at every character, even in deterministic processing, as opposed to throwing the lexing machinery at each character. The latter can be incredibly simple: in lexical state X, use the next character index to jump through a table to an action (typically "store character in buffer") that jumps to the next state. We're talking a few machine instructions for the lexical part, hundreds for the GLR loop. Its all O(N) deterministically, but O notation means, "ignoring the constant factors". This whole discussion is about the constant. – Ira Baxter Aug 28 '14 at 20:39
Okay, so basically there's a large amount of overhead generated in a GLR parser(in the deterministic case) from bypassing all of the unnecessary capabilities of the GLR in order to do what the scanner was going to do anyway, correct? I'd be interested to see what that extra overhead looks like (extra memory allocations, function calls, conditionals, and forks?) but I think that would be a question best resolved by me going and looking at the internals of a GLR parser. Thanks for your help/patience on this two year old question. – doliver Aug 28 '14 at 22:25
Right. The *one* place the scannerless parsing makes sense, is where sequences of characters, depending of the choice of parse, don't break up into tokens of the same size. Then you can't have a single lexer, that produces a single stream of lexemes. In this case, I think you have to fall back to character-by-character parsing. The ideal answer IMHO is one in which you do lexer-like scanning where it works, and character-by-character scanning only where forced to do it. In 20 years of building front ends with a GLR parser/traditional lexer, we've managed to wriggle out of this every time. – Ira Baxter Aug 28 '14 at 22:48

score 4 · Answer 2 · answered Jul 21 '12 at 13:04

Ideally, the tokens themselves should be unambiguous; you should always be able to tokenise an input stream without the parser doing any additional work.

This isn't always so simple, so you have some tools to help you out:

Start conditions

A lexer action can change the scanner's start condition, meaning it can activate different sets of rules.

A typical example of this is string literal lexing; when you parse a string literal, the rules for tokenising usually become completely different to the language containing them. This is an example of an exclusive start condition.

You can separate ambiguous lexings if you can identify two separate start conditions for them and ensure the lexer enters them appropriately, given some preceding context.
Lexical tie-ins

This is a fancy name for carrying state in the lexer, and modifying it in the parser. If a certain action in your parser gets executed, it modifies some state in the lexer, which results in lexer actions returning different tokens. This should be avoided when necessary, because it makes your lexer and parser both more difficult to reason about, and makes some things (like GLR parsers) impossible.

The upside is that you can do things that would require significant grammar changes with relatively minor impact on the code; you can use information from the parse to influence the behaviour of the lexer, which in turn can come some way to solving your problem of what you see as an "ambiguous" grammar.
Logic, reasoning

It's probable that it is possible to lex it in one parse, and the above tools should come second to thinking about how you should be tokenising the input and trying to convert that into the language of lexical analysis. :)

The fact is, your input is comprised of tokens—whether you like it or not!—and all you need to do is find a way to make a program understand the rules you already know.

I've found that start conditions are sometimes implemented via lexing modes: https://github.com/SAP/Chevrotain/blob/master/examples/lexer/multi_mode_lexer/multi_mode_lexer.js And certain tokens when lexed pushes the lexer into a different state/mode and then certain tokens exist the state/mode. — CMCDragonkai, Nov 13 '17 at 13:26

How can a lexer extract a token in ambiguous languages?

2 Answers2