Output language/format for toy compiler

Question

I took a compilers course in university, and it was very informative and a lot of fun, although also a lot of work. Since we were given a language specification to implement, one thing I didn't learn much about was language design. I'm now thinking of creating a simple toy language for fun, so that I can play around and experiment with different language design principles.

One thing I haven't decided as of yet is what language or format I'd like my compiler to output. Ideally, I'd like to output bytecode for a virtual machine which is easy to use and also has some facilities for debugging (e.g. being able to pause execution and look at the stack at any point.) I've not found one that struck my fancy yet, though. To give you an idea of what I'm looking for, here are some of the options I've considered, along with their pros and cons as I see them:

I could output textual x86 assembly language and then invoke an assembler like NASM or FASM. This would give me some experience compiling for actual hardware, as my previous compiler work was done on a VM. I could probably debug the generated programs using gdb, although it might not be as easy as using a VM with debugging support. The major downside to this is that I have limited experience with x86 assembly, and as a CISC instruction set it's a bit daunting.
I could output bytecode for a popular virtual machine like the JVM or Lua virtual machine. The pros and cons of these are likely to vary according to which specific VM I choose, but in general the downside I see here is potentially having to learn a bytecode which might have limited applicability to my future projects. I'm also not sure which VM would be best suited to my needs.
I could use the same VM used in my compilers course, which was designed at my university specifically for this purpose. I am already familiar with its design and instruction set, and it has decent debugging features, so that's a huge plus. However, it is extremely limited in its capabilities and I feel like I would quickly run up against those limits if I tried to do anything even moderately advanced.
I could use LLVM and output LLVM Intermediate Representation. LLVM IR seems very powerful and being familiar with it could definitely be of use to me in the future. On the other hand, I really have no idea how easy it is to work with and debug, so I'd greatly appreciate advice from someone experienced in that area.
I could design and implement my own virtual machine. This has a huge and obvious downside: I'd essentially be turning my project into two projects, significantly decreasing the likelihood that I'd actually get anything done. However, it's still somewhat appealing in that it would allow me to make a VM which had "first-class" support for the language features I want—for instance, the Lua VM has first-class support for tables, which makes it easy to work with them in Lua bytecode.

So, to summarize, I'm looking for a VM or assembler I can target which is relatively easy to learn and work with, and easy to debug. As this is a hobby project, ideally I'd also like to minimize the chance that I spend a great deal of time learning some tool or language that I'll never use again. The main thing I hope to gain from this exercise is some first-hand understanding of the complexities of language design, though, so anything that facilitates a relatively quick implementation will be great.

A good compiler design would allow you to convert your IR into anything if you changed your code generation. I would simply start by converting your language to some IR and then create a backend that converted it to C\C++. That way you can learn about different components of the compiler and not have to worry about verifying some lower-level language constructs in asm or bytecode. — linuxuser27, May 06 '12 at 20:10

score 6 · Accepted Answer · answered May 06 '12 at 20:26

It really depends on how complete a language you want to build, and what you want to do with it. If you want to create a full-blown language for real projects that interacts with other languages, your needs are going to be much greater than if you just want to experiment with the complexities of compiling particular language features.

Output to an assembly language file is a popular choice. You can annotate the assembly language file with the actual code from your program (in comments). That way, you can see exactly what your compiler did for each language construct. It might be possible (it's been a long time since I worked with these tools) to annotate the ASM file in a way that makes source-level debugging possible.

If you're going to be working in language design, then you'll almost certainly need to know x86 assembly language. So the time you spend learning it won't be wasted. And the CISC instruction set really isn't a problem. It'll take you a few hours of study to understand the registers and the different addressing modes, and probably less than a week to be somewhat proficient, provided you've already worked with some other assembly language (which it appears you have).

Outputting byte code for JVM, lua, or .NET is another reasonable approach, although if you do that you tie yourself to the assumptions made by the VM. And, as you say, it's going to require detailed knowledge of the VM. It's likely that any of the popular VMs would have the features you need, so selection is really a matter of preference rather than capabilities.

LLVM is a good choice. It's powerful and becoming increasingly popular. If you output LLVM IR, you're much more likely to be able to interact with others' code, and have theirs interact with yours. Knowing the workings of LLVM is a definite plus if you're looking to get a job in the field of compilers or language design.

I would not recommend designing and implementing your own virtual machine before you get a little more experience with other VMs so that you can see and understand the tradeoffs that they made in implementation. If you go down this path, you'll end up studying JVM, lua, .NET, and many other VMs. I'm not saying not to do it, but rather that doing so will take you away from your stated purpose of exploring language design.

Knowledge is rarely useless. Whatever you decide to use will require you to learn new things. And that's all to the good. But if you want to concentrate on language design, select the output format that requires the least amount of work that's not specifically language design. Consistent, of course, with capabilities.

Of your options, it looks to me like your university's VM is out. I would say that designing your own VM is out, too. Of the other three, I'd probably go with LLVM. But then, I'm very familiar with x86 assembly so the idea of learning LLVM is somewhat appealing.

Thanks for the very thorough response! I'm very much leaning towards LLVM IR now. I'll maybe look at separately doing my own x86 backend later. I don't plan to do professional work in language design or compilers, but x86 assembly is a valuable knowledge for any programmer, I figure. — Mitch Lindgren, May 07 '12 at 01:26

score 5 · Answer 2 · answered May 06 '12 at 20:45

5

Have a look at my Programming Languages ZOO. It has a number of toy implementations of languages, including some virtual machines and made-up assembly (a stack machine). It should help you get started.

answered May 06 '12 at 20:45

Andrej Bauer

2,458
17
26

Wouldn't that just completely ruin the fun factor? You can't get the thrill of Man over Machine by copying somebody else's compiler. – Hans Passant May 06 '12 at 20:50
1

I never said he should copy it, but what is the point of reinventing the wheel? These implementations are very short, on the order of 500 lines, including a lot of comments. They are not real programming languages. – Andrej Bauer May 06 '12 at 21:46
I think these are great! Being completely inexperienced at language design except insofar as having a simple grasp of what I think is a reasonably wide variety of paradigms and principles, I am certainly not expecting to invent anything completely novel on my first try. I'm more interested in playing around with different ideas, seeing which ones mesh together well, and seeing what it takes to implement them. To that end, I think these examples could be very helpful. Thanks, Andrej! – Mitch Lindgren May 06 '12 at 22:37
1

@Hans Passant: What precisely is your problem here? If the OP thinks it is fun to reinvent 50 years of experience with programming languages, he need not look at what other people did, and neither does he have to use the Internet. The PL Zoo is a useful *resource* becaause it demonstrates basic points about programming language design: how to implement closures, how to translate a functional language to bytecode, an imperative language to a stack machine, how to evaluate lazily, how to implement type inference, how to implement record subtyping, how to implement prolog-style search, etc. – Andrej Bauer May 06 '12 at 22:37
Absolutely awesome resource! Brilliant! Also how beautiful is ML, honestly. Love it. – Dimitar Dimitrov Sep 03 '13 at 03:00

score 1 · Answer 3 · answered Jun 19 '12 at 06:42

1

If your just playing around with language design what about an interpreted language? Having the whole AST still around at run time lets you do some very cool things.

answered Jun 19 '12 at 06:42

John F. Miller

26,961
10
71
121

Output language/format for toy compiler

3 Answers3