9

I am working on new programming language, but I was always puzzled by the fact that everyone is using yaxx/lex to parse the code, but I am not.

My compiler (which is already working) is handcoded in C++/STL, and I cannot say it's complex or took too much time. It has both some kind of lexer and parser, but they are not autogenerated.

Earlier, I wrote a C compiler(not full spec) the same way - it was able to compile the program in 1 pass, with all these back references resolving & preprocessing - this is definitely impossible with yacc/lex.

I just cannot convince myself to scrap all this, and start diving into yaxx/lex - which might need quite an effort to implement and might possibly introduce some grammar limitations.

Is there something I miss when not using yacc/lex? Do I do an evil thing?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
BarsMonster
  • 6,483
  • 2
  • 34
  • 47
  • It depends on how you're during it. I'm using context-free expressions - of a form that hasn't seen publication until the late 2010's and early 2020's. The parser for C-BC, for instance, was manually computed algebraically with the aid of context-free expressions. https://github.com/RockBrentwood/CBC ... the same for the regular expression syntax used here https://github.com/RockBrentwood/RegEx, but only the lexical part of the syntax for the resurrected version of AutoMath here https://github.com/RockBrentwood/AutoMath, but not the parser (yet). The grammar is still in yacc. – NinjaDarth Nov 12 '22 at 22:06

5 Answers5

6

The main advantages of using any kind of lexer/parser generator is that it gives you a lot more flexibility if your language evolves. In a hand-coded lexer/parser (especially if you've mixed in a lot of functionality in a single pass!), changes to the language get nasty fairly quickly, whereas with a parser generator you make the change, re-run the generator, and move on with your life. There are certainly no inherent technical limitations to always just writing everything by hand, but I think the evolvability and maintainability of automating away the boring bits is worth it!

Gian
  • 13,735
  • 44
  • 51
  • 2
    Good, modular design (for example, tokenizer and lexical analyser should be completely seperated) can make this less of an issue. Still important. –  Aug 29 '10 at 11:23
4

Yacc is inflexible in some ways:

  • good error handling is hard (basically, its algorithm is only defined to parse a correct string correctly, otherwise, all bets are off; this is one of the reasons that GCC moved to a hand-written parser)
  • context-dependency is hard to express, whereas with a hand-written recursive descent parser you can simply add a parameter to the functions

Furthermore, I have noticed that lex/yacc object code is often bigger than a hand-written recursive descent parser (source code tends to be the other way round).

I have not used ANTLR so I cannot say if that is better at these points.

jilles
  • 10,509
  • 2
  • 26
  • 39
3

The other huge advantage of using generators is that they are guaranteed to process exactly and only the language you specified in the grammar. You can't say that of any hand-written code. The LR/LALR variants are also guaranteed to be O(N), which again you can't assert about any hand coding, at least not without a lot of effort in constructing the proof.

I've written both and lived with both and I would never hand-code again. I only did that one because I didn't have yacc on the platform at the time.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • Well, I do not need a mathematical proof to know that my code is fast - usually it's obvious that it's somewhere in O(N) or O(N*LogN). – BarsMonster Aug 29 '10 at 02:23
  • 2
    I didn't say anything about 'fast'. I am talking about algorithmic complexity, and you need a proof for that. – user207421 Aug 29 '10 at 02:57
  • 1
    Well, when I write code, I always know it's complexity. If tool if writing code for me - then I definitely would need a proof. – BarsMonster Aug 29 '10 at 03:26
  • 4
    You've got it back to front. LR parsing has an O(N) proof by Don Knuth, 1965, and LALR parsing ditto by Frank deRemer, 1969. Where's yours? You can't just 'know' the algorithmic complexity of something as complex as a hand-written compiler. – user207421 Aug 29 '10 at 09:45
  • Okay, so he doesn't know (for sure) - so what? "Compiles in O(n)" is not much of a selling point. If it's "fast enough" for average projects, nobody will care. –  Aug 29 '10 at 11:23
  • 2
    If this is a marketing discussion it's news to me. What I am discussing here are the advantages of compiler generators, which is what the OP was asking about. Accepting exactly the correct language and O(1) complexity are important whether you can market them directly or not, just as sunk cost, maintainability, extemsibility, and time to market are too, although you don't market them either. Otherwise what exactly were Don Knuth and Frank de Remer doing? – user207421 Aug 30 '10 at 11:40
1

Maybe you are missing out on ANTLR, which is good for languages that can be defined with a recursive-descent parsing strategy.

There are potentially some advantages to using Yacc/Lex, but it is not mandatory to use them. There are some downsides to using Yacc/Lex too, but the advantages usually outweigh the disadvantages. In particular, it is often easier to maintain a Yacc-driven grammar than a hand-coded one, and you benefit from the automation that Yacc provides.

However, writing your own parser from scratch is not an evil thing to do. It may make it harder to maintain in future, but it may make it easier, too.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
0

It certainly depends on the complexity of your language grammar. An easy grammar means that there is an easy implementation and you can just do it yourself.

Take a look at maybe the worst possible example at all: C++ :) (Does anybody knows another language, besides natural languages, which are more difficult to parse correctly?) Even with tools like Antlr, is it quite difficult to get it right, though it is manageable. Whereby on the other side, even while being much harder, it seems that some of the best C++ parsers, e.g. GCC and LLVM, are also mostly handwritten.

If you don't need too much flexibility and your language is not too trivial, you will certainly safe some work/time by using Antlr.

Albert
  • 65,406
  • 61
  • 242
  • 386
  • The complexity of my language is just a little less than C++. – BarsMonster Aug 29 '10 at 02:14
  • 1
    Classic Fortran is pretty hard to parse: the assignment "`DO10I=12.43`" and the DO loop "`DO10I=12,43`" cannot even be split up into words accurately until you get to the comma vs full stop. (The first involves a floating point variable `DO10I`, the second involves an integer variable `I` and a label `10`: `DO 10 I = 12, 43`.) This tends to make it rather hard! – Jonathan Leffler Aug 29 '10 at 09:07
  • "a little less than C++" - that's an instance of self-limitation, which pretty much answers your question. You're limiting yourself to within a certain threshold of complexity by the choice to go manual - while others - who are using the automation to crank out parsers - are not bound by this constraint. I'd be more interested in seeing you raise your game a little and try to crank out a C++ parser by hand. A primitive version (with yacc included) is: Release E of "cfront" transcribed https://www.softwarepreservation.org/projects/c_plus_plus/cfront/release_e We might GitHub it later. – NinjaDarth Nov 12 '22 at 22:40