LR(1) parser state size still an issue?

Question

Historically, LALR(1) parsers were preferred over LR(1) parsers because of resource requirements required by the large number of states generated by LR(1) parsers. It's hard to believe that this continues to be an issue in today's computing environment. Is this still the case or are modern compilers now built with canonical LR parsers, since LALR grammars are a proper subset of LR grammars?

Still LALR(1) as far as I know. There's really no reason to pay the extra space cost, which is an order of magnitude or so, considering the tiny differences in the grammar classes. — user207421, Jun 12 '14 at 00:55
Is the extra space cost an actual issue? True, the number of states is an order of magnitude greater, but the memory costs are puny compared to what they were in the past. Considering that it's actually easier to construct an LR algorithm than LALR, wouldn't the programmers time be a stronger consideration towards favoring LR? — tgoneil, Jun 12 '14 at 01:11
The *relative* space cost is an issue. I was in class with Frank DeRemer (ahem) some *(many)* years ago and he was asked this question as futorology - what happens when memory becomes so big etc. - and he said it still wouldn't be worth it. He also said that using huge LR tables would be bad for cache coherency, so a performance issue. — user207421, Jun 12 '14 at 03:31
Ok, in server environments supporting multiple clients that especially makes sense. Thanks. — tgoneil, Jun 12 '14 at 12:11
Unless you're talking about a NFA regex based parser, you absolutely don't have to worry about states. Any other parser that uses too much resources is either not handwritten or the format is naturally complex. — Leopold Asperger, Jun 12 '14 at 12:14
@LeopoldAsperger You're confusing parsers with scanners, but they both have states, and parsers typically many more of them. Your remark about 'either not handwritten or the format is naturally complex' is basically meaningless, and in any case the question is about generated parsers. — user207421, Jun 12 '14 at 12:46
@ESP I'm not confusing anything. You may say that a scanner and a parser are always separate programs, but in my experience most real implementations have both characteristics at the same time, and that is how it should be to reach optimal performance. But I'm not going to enter a parser/lexer/scanner discussion. — Leopold Asperger, Jun 12 '14 at 15:03
@EJP is trying to clarify the issue around Generated states from parsers. There is no implication about handwritten anything. LR(1) parsers generate an order of magnitude more states than LALR(1) parsers. So yes, the number of states generated IS the central issue around the question. — tgoneil, Jun 12 '14 at 17:12
@LeopoldAsperger When you talk about 'a NFA regex based parser' you are indeed confusing parsers with scanners. Scanners use regular expressions and NFAs and DFAs. Parsers use DPDAs. Putting irrelevant nonsense into my mouth about how 'a scanner and a parser are always separate programs' doesn't constitute rational argument. — user207421, Jun 13 '14 at 10:22

templatetypedef · Accepted Answer · 2014-06-12T17:11:35.410

The main concern with LR(1) parsers is the table size, and that table size is going to hurt in one way or another.

If you have an LR(1) parser with 10,000,000 states (not all that uncommon) where there are, say, 50 nonterminals and 50 terminals (not all that unreasonable), you will have a table with one billion entries in it. If you use even one byte per entry, you now need 1GB of space just to hold the table. That space either is in the application binary, in which case you now have a 1GB executable, or it's generated dynamically, in which case you now need 1GB of RAM plus the time to populate it. Neither of these are very attractive.

You absolutely could use an LR(1) parser if you have that kind of memory, but it wouldn't be a good idea. First, the size of the application binary would be enormous. This would make it difficult to distribute the application. Second, the act of loading the table into memory would require a transfer of about 1GB of data from disk into RAM, which would be extraordinarily slow. There's also the issue of paging in and out the parsing tables. If the OS doesn't do a good job evicting pages, you could end up thrashing, degrading performance unacceptably.

While you could put the parser on a server, this typically isn't done right now and would require that all compilation be done over a network.

There's also the question of whether it's worth it. The huge spike in resource costs from the parser would need to be justified by some proportional benefit in parsing quality. In practice, LALR parsers would work for many grammars. For those that it doesn't work for, newer parsing algorithms like IELR or GLR would be a superior choice to LR(1) because they offer the same parsing power (or more in the case of GLR) with significant space reductions. Consequently, you'd be better off using those algorithms.

In summary, yes, you could use LR(1) today, but it would be so resource inefficient that you'd be better off with another parsing algorithm.

Hope this helps!

Yes, we already know there are lots and lots of states generated by LR(1) parsers. The point of the question is whether that's a practical issue today. Modern laptops and computers have upwards of 8GB of memory and more. And they are really fast. So time and space shouldn't be an issue in a single user environment. That's what prompted my question in the first place. Server environments, on the other hand, have to share resources with lots of users, making it more compelling to be efficient with resources and therefore to implement LALR, instead. — tgoneil, Jun 12 '14 at 12:21
@tgonelli Why would we be running compilers in server environments? — user207421, Jun 12 '14 at 12:48
@tgoneil I've updated my answer to address your question. Can you review it and decide whether this helps? — templatetypedef, Jun 12 '14 at 17:12
@templatetypedef, thank you for the tip on alternative parsing algorithms like IELR and GLR that I can investigate. — tgoneil, Jun 12 '14 at 17:21

score 4 · Answer 2 · 2019-05-20T23:14:33.273

Minimal LR(1) parsers solve this problem. Dr Pager was the first one to write a paper on how to do this in 1977. Minimal LR(1) parser have all the power of canonical LR(1) parsers, recognizing the same language defined by an LR(1) grammar. However, minimal LR(1) parsers have parser tables almost as small as LALR(1) parser tables.

The trick required is to merge compatible states while building the canonical LR(1) state machine. This is complicated and the lookahead-set computation is just as complicated as it is for LALR(1). But the end result is beautitul.

BTW, the LRSTAR Parser Generator creates minimal LR(1) and minimal LR(k) parsers, very powerful.

LR(1) parser state size still an issue?

2 Answers2