LLVM JIT Parser writing with Bison / Antlr / Packrat / Elkhound /

Question

In the LLVM tutorials there is instruction how to write simple JIT compiler. Unfortunatelly, lexer and parser in this tutorial is written manually. I was thinking, such solution is good for learning purposes but it is not suitable for writing complex, production ready compilers. It seems that GCC and few other "big compilers" are hand written though. But I think, that all these parser generators give a big boost when writing own compiler (especially when youre doing it alone, without team of people).

Is it possible to use any existing parser generator like Bison / Antlr / Packrat / Elkhound etc. together with LLVM to create JIT compiler? I want to be able to "feed" the parser constantly (not once on the beginning) with expressions and compile them in runtime.

Additional I've found a lot of questions about "best, modern" parser generator (like this one: https://stackoverflow.com/questions/428892/what-parser-generator-do-you-recommend). If it is possible to use these tools to create LLVM JIT compiler, I would be thankful for any additional hints and recomendation, which tool would be best in terms of performance and flexibility in this particular case.

"Such solution is good for learning purposes but it is not suitable for writing complex, production ready compilers" - Hm. I always thought GCC was a complex and production-ready compiler. Whatever... — , Feb 06 '13 at 18:30
GCC was using bison on the beginning, but you're right - I'm fixing it in my question. But really, I would love to use a generator to simplify this task if it is possible. — Wojciech Danilo, Feb 06 '13 at 18:34
If anything, I would say rather the opposite is true: yacc, Bison, et al, are suitable for learning purposes and such, but for serious production work, a hand-written parser may be the only way to meet requirements. — Jerry Coffin, Feb 06 '13 at 18:35
What this guy said. `^^` - When I wanted to make my first parser, I know little about it, so I learned flex and yacc. After trying to get my parser well, I just decided to write it by hand. — , Feb 06 '13 at 18:37
@JerryCoffin: What requirements are you talking about? JIT? Is using such parser generator really bad thing while writing custom language compiler? Cannot we get what we really need with these tools? Do they not give us some "boost" while writing compilers? (I'm not total noob, I was writing compilers with bison and ply, but never such big I am planning right now) — Wojciech Danilo, Feb 06 '13 at 18:39
@danilo2: I'm not saying Bison (etc.) can't be used for anything big or complex. I'm saying the progression isn't "hand written parser if possible, parser-generator if the job is too big or complex", but the opposite of that -- parser generator if possible, hand-written if that can't meet your requirements. — Jerry Coffin, Feb 06 '13 at 18:54
As to what requirements those would be: could be class of grammar, incremental parsing, contextual tokenizing, speed, memory usage, or almost any number of others. — Jerry Coffin, Feb 06 '13 at 18:56
@JerryCoffin: Thnak you, so what would you suggest for writing a JIT compiler for language with complexity of for example Python? Should I start with hand written or use a parser generator? — Wojciech Danilo, Feb 06 '13 at 19:37
I wouldn't suggest writing a JIT for a language as complex as Python. Historically, JITs for Python end up *extremely* restricted in what they support (efficiently or sometimes, at all), immature, unreliable, plain wrong, and/or slow. PyPy is different because they didn't write a JIT, they automatically generate it from an interpreter. Perhaps you're underestimating the complexity of Python. Or the complexity of JIT compilers. Or you're overestimating the complexity of your language. — , Feb 06 '13 at 19:53
@danilo2: Are you talking about compiling from Python source to byte code, or byte code to machine code, or Python source directly to machine code? Do you just want JIT, or do you want incremental parsing (for a REPL, for example)? Ultimately, I don't think it matters a lot though. Python has fairly simple (LL(1)) syntax, and complex semantics, so if your language is similar, the parser is probably the least of your problems. — Jerry Coffin, Feb 06 '13 at 19:56
Ahh I'm sorry for this - I'm not talking about Python, I'm talking about language, whose syntax is simmilar to Python but it is staticaly typed. Maybe better example would be Java - sorry for the confusion. — Wojciech Danilo, Feb 06 '13 at 20:08
And because of the "fairly simple" syntax of my language I wanted to use parser genearator - so In such case - should I stick with hand written parser or maybe is it possible to use generator together with LLVM JIT? — Wojciech Danilo, Feb 06 '13 at 20:10
It is possible to build really sophisticated parsers with parser generators (we use GLR) with very high productivity and maintainability (including the famously hard-to-parse C++; in our case, full C++11 for ANSI/MS/GCC). I suspect one can produce pretty good error messages with such parser generators by extending them explicitly with error-handling productions (see recent paper by Visser in TOPLAS). What you can't do with a hand-written parsers is to build grammar analysis tools, incremental editors, syntax-directed pattern matchers and transform tools, etc. That seems like a huge loss. — Ira Baxter, Feb 07 '13 at 01:20

score 9 · Accepted Answer · answered Feb 07 '13 at 01:05

There are a lot of advantages to using a parser generator like bison or antlr, particularly while you're developing a language. You'll undoubtedly end up making changes to the grammar as you go, and you'll want to end up with documentation of the final grammar. Tools which produce a grammar automatically from the documentation are really useful. They also can help give you confidence that the grammar of the language is (a) what you think it is and (b) not ambiguous.

If your language (unlike C++) is actually LALR(1), or even better, LL(1), and you're using LLVM tools to build the AST and IR, then it's unlikely that you will need to do much more than write down the grammar and provide a few simple actions to build the AST. That will keep you going for a while.

The usual reason that people eventually choose to build their own parsers, other than the "real programmers don't use parser generators" prejudice, is that it's not easy to provide good diagnostics for syntax errors, particularly with LR(1) parsing. If that's one of your goals, you should try to make your grammar LL(k) parseable (it's still not easy to provide good diagnostics with LL(k), but it seems to be a little easier) and use an LL(k) framework like Antlr.

There is another strategy, which is to first parse the program text in the simplest possible way using an LALR(1) parser, which is more flexible than LL(1), without even trying to provide diagnostics. If the parse fails, you can then parse it again using a slower, possibly even backtracking parser, which doesn't know how to generate ASTs, but does keep track of source location and try to recover from syntax errors. Recovering from syntax erros without invalidating the AST is even more difficult than just continuing to parse, so there's a lot to be said for not trying. Also, keeping track of source location is really slow, and it's not very useful if you don't have to produce diagnostics (unless you need it for adding debugging annotations), so you can speed the parse up quite a bit by not bothering with location tracking.

Personally, I'm biased against packrat parsing, because it's not clear what the actual language parsed by a PEG is. Other people don't mind this so much, and YMMV.

Why it is "not clear" what is the actual language? PEG is well-defined, even with all the cool hacks that packrat allows to do (high-order parsing and such). — SK-logic, Feb 07 '13 at 08:50
@SK-logic: well-defined is not the same as clear. A hand-crafted parser written in C++ is well-defined. A Turing machine is well-defined. Yes, PEG is well-defined. But for all of them, the only way to see if a given string is in the language is to execute the code. (Of those three alternatives, PEG is the least bad, imo. But I still prefer formal context free grammars. However, as I said, other people like PEG, and whatever works for you is cool with me.) — rici, Feb 07 '13 at 17:32
From my practical experience, PEGs are the most clear and easy to read grammars. I can translate a language spec straight into a PEG with very little modifications. It is possible to obfuscate it, of course, but I have not seen a really bad grammar yet. Whereas there are many unreadable beyond any hope Yacc grammars. — SK-logic, Feb 07 '13 at 18:40

LLVM JIT Parser writing with Bison / Antlr / Packrat / Elkhound /

1 Answers1