47

All over the web, I am getting the feeling that writing a C backend for a compiler is not such a good idea anymore. GHC's C backend is not being actively developed anymore (this is my unsupported feeling). Compilers are targeting C-- or LLVM.

Normally, I would think that GCC is a good old mature compiler that does performs well at optimizing code, therefore compiling to C will use the maturity of GCC to yield better and faster code. Is this not true?

I realize that the question greatly depends on the nature of the language being compiled and on other factors such that getting more maintainable code. I am looking for a rather more general answer (w.r.t. the compiled language) that focuses solely on performance (disregarding code quality, ..etc.). I would be also really glad if the answer includes an explanation on why GHC is drifting away from C and why LLVM performs better as a backend (see this) or any other examples of compilers doing the same that I am not aware of.

muhmuhten
  • 3,313
  • 1
  • 20
  • 26
aelguindy
  • 3,703
  • 24
  • 31

11 Answers11

28

While I'm not a compiler expert, I believe that it boils down to the fact that you lose something in translation to C as opposed to translating to e.g. LLVM's intermediate language.

If you think about the process of compiling to C, you create a compiler that translates to C code, then the C compiler translates to an intermediate representation (the in-memory AST), then translates that to machine code. The creators of the C compiler have probably spent a lot of time optimizing certain human-made patterns in the language, but you're not likely to be able to create a fancy enough compiler from a source language to C to emulate the way humans write code. There is a loss of fidelity going to C - the C compiler doesn't have any knowledge about your original code's structure. To get those optimizations, you're essentially back-fitting your compiler to try to generate C code that the C compiler knows how to optimize when it's building its AST. Messy.

If, however, you translate directly to LLVM's intermediate language, that's like compiling your code to a machine-independent high-level bytecode, which is akin to the C compiler giving you access to specify exactly what its AST should contain. Essentially, you cut out the middleman that parses the C code and go directly to the high-level representation, which preserves more of the characteristics of your code by requiring less translation.

Also related to performance, LLVM can do some really tricky stuff for dynamic languages like generating binary code at runtime. This is the "cool" part of just-in-time compilation: it is writing binary code to be executed at runtime, instead of being stuck with what was created at compile time.

Matt
  • 10,434
  • 1
  • 36
  • 45
  • JIT compilation is a big plus of course. I agree with what you are saying to an extent.. I just can't see what one "loses" when compiling to C but does not lose it if one compiles to LLVm, can you please elaborate on that a little bit. I mean I agree that one "loses" some structure with language transformation, but does the same not happen with LLVM or any other backend language? – aelguindy Jan 26 '12 at 10:19
  • 2
    OK, imagine in C the statement `x++` - this could be compiled to copy x to another register, then increment the value of x, then return the copied (previous) value of x. A very obvious optimization is to compile this using a [test-and-set](http://en.wikipedia.org/wiki/Test-and-set) instruction, if the processor supports it, which does exactly this, but faster and atomically. If you represent the same statement in C as `x = x + 1`, it may not be optimized, because it's not exactly the same - you never need to return the previous value, right? – Matt Jan 29 '12 at 23:26
  • 3
    So to get this optimization, you would have to build _your_ compiler - the one generating c - to know the difference between the two and produce different c code depending on the situation. If you compile to LLVM bytecode, LLVM can infer this from your generated bytecode by e.g. checking if you look at the return value and deciding to optimize then. GCC _may_ be smart enough to do this, since this is such a trivial example, but it's just easier for an optimizer to find this sort of low hanging fruit when dealing with a lower level language, like LLVM bytecode, than when dealing with c. – Matt Jan 29 '12 at 23:30
  • 1
    *"The creators of the C compiler have probably spent a lot of time optimizing certain human-made patterns in the language"* Do you have any evidence of this, and everything that follows it, at all? When you're compiling to C you're going to simplify the patterns, yes, but simple patterns are often the most efficient ones. – corazza Dec 12 '14 at 17:56
  • @jco I encourage you to look at [GCC's optimizations](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html). That page is 1431 lines documenting the GCC team's performance enhancements. A good many of these are language-level optimizations. Picking one at random: having the compiler unroll some loops (-funroll-loops) as opposed to unrolling all loops (-funroll-all-loops), when the second is known to be less performant. In order to use GCC's optimizations, you have to target the precursor C code to trigger that unrolling, and study the binary to validate. Better just to target LLVM, IMO. – Matt Dec 13 '14 at 21:49
  • Note that there is a terrible difference between LLVM IR as principle (universal backend for languages), or LLVM IR in practice (anything not stressed by the clang frontend is unimplemented, undocumented and/or buggy) – Marco van de Voort Sep 06 '18 at 18:34
  • This answer makes no sense: if targetting C loses some magic idiomatic information, targeting IR should lose _even more_ since it's what C converts to. If LLVM can figure out the `x++` pattern from a `x = x + 1` then no extra info needs to come down from the C layer. Besides, the first thing that LLVM does with the IR is to canonicalize it, so it first tries to _remove_ extra information. –  Feb 15 '19 at 16:06
28

Let me list my two biggest problems with compiling to C. If this is a problem for your language depends on what kind of features you have.

  • Garbage collection When you have garbage collection you may have to interrupt regular execution at just about any point in the program, and at this point you need to access all pointers that point into the heap. If you compile to C you have no idea where those pointers might be. C is responsible for local variables, arguments, etc. The pointers are probably on the stack (or maybe in other register windows on a SPARC), but there is no real access to the stack. And even if you scan the stack, which values are pointers? LLVM actually addresses this problem (thought I don't know how well since I've never used LLVM with GC).

  • Tail calls Many languages assume that tail calls work (i.e., that they don't grow the stack); Scheme mandates it, Haskell assumes it. This is not the case with C. Under certain circumstances you can convince some C compilers to do tail calls. But you want tail calls to be reliable, e.g., when tail calling an unknown function. There are clumsy workarounds, like trampolining, but nothing quite satisfactory.

augustss
  • 22,884
  • 5
  • 56
  • 93
  • Many languages = functional languages ? :-) – Marco van de Voort Jan 26 '12 at 14:00
  • Yes, mostly functional, I believe. – augustss Jan 26 '12 at 16:34
  • 10
    Lennart is spot on here - these are two of the main issues we have had with GHC in compiling via C. Dealing with garbage collection properly really entails managing the stack yourself - the alternative is to use conservative GC, which is not really viable in a production system. LLVM solves the tail-call problem, but its solution to the GC problem isn't good enough yet for GHC (and it's not clear it ever will be - there was a serious attempt in C-- to do this right, and even there it entailed some compromises). – Simon Marlow Feb 07 '12 at 11:51
  • But wait, tail call optimization were even in [the ancient gcc-3.0.4](https://gcc.gnu.org/onlinedocs/gcc-3.0.4/gcc/Optimize-Options.html) *(search for -foptimize-sibling-calls)*. – Hi-Angel Sep 24 '15 at 07:40
  • 1
    As far as I know gcc cannot optimize all kinds of tail calls. Tail calls can be to known functions, but can also be indirect, i.e., via a function pointer. – augustss Sep 24 '15 at 22:49
  • Last time I saw Tail calls were optimized out by GCC – circl May 16 '23 at 09:40
8

Part of the reason for GHC's moving away from the old C backend was that the code produced by GHC was not the code gcc could particularly well optimise. So with GHC's native code generator getting better, there was less return for a lot of work. As of 6.12, the NCG's code was only slower than the C compiled code in very few cases, so with the NCG getting even better in ghc-7, there was no sufficient incentive to keep the gcc backend alive. LLVM is a better target because it's more modular, and one can do many optimisations on its intermediate representation before passing the result to it.

On the other hand, last I looked, JHC still produced C and the final binary from that, typically (exclusively?) by gcc. And JHC's binaries tend to be quite fast.

So if you can produce code the C compiler handles well, that is still a good option, but it's probably not worth jumping through too many hoops to produce good C if you can easier produce good executables via another route.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • I *believe* the GCC backend is supposed to stick around for bootstrapping (that's just a vague memory though). For LLVM, a major reason is that is't modular *and there already are lots of great modules*. Specifically, using LLVM gives you (paraphrasing Don Steward) 25 years of imperative language optimizations for free, and more target architectures than most compiler teams can ever hope to implement (let alone maintain). –  Jan 23 '12 at 18:47
  • Yes, for bootstrapping it's around to stay. It's not as important to produce code suitable for gcc's optimiser then, and that's maintainable with little enough work. But `-fvia-C` is already gone in 7.2. Second point: Stewart, with 't'. I don't know if he said it, but it's very true. – Daniel Fischer Jan 23 '12 at 18:52
  • Source: http://donsbot.wordpress.com/2010/03/01/evolving-faster-haskell-programs-now-with-llvm/. About the t: Yeah, sorry - sadly I can't edit any more. –  Jan 23 '12 at 18:56
  • Generating C code and letting the underlying system's C compiler optimize for you gives you >25 years of imperative language optimizations for free. :-) – R.. GitHub STOP HELPING ICE Jan 23 '12 at 19:00
  • 3
    @R.. Not to the same extent, as others have argued. C compilers are good at optimizing C code, not at optimizing arbitary low-level code; and C is far worse than (LLVM) assembly at encoding low-level details (and a compiler like GHC does optimize at this level). For instance, try pinning registers and selectively replacing function pointers with the code they point to - for C, that requires an obscure GCC extension and 2000 lines of Perl code. For LLVM, it requires defining a calling convention, some care when generating code, and 180 lines of Haskell. –  Jan 23 '12 at 19:18
  • Sounds like that doesn't fall under "imperative language optimizations". I agree there might be benefits to functional languages in not going through an imperative language for the translation, but I still question the magnitude of the benefits especially relative to the costs. – R.. GitHub STOP HELPING ICE Jan 23 '12 at 19:27
  • 3
    Well, the LLVM backend gives the existing register allocators lots of freedom to utilize even "pinned" registers when sensible (at least in between function calls), while the C Backend works by making the pinned register completely unavailable to GCC's register allocator. And what cost? The LLVM backend was created in a few man-months (well, I'm guessing here, but it was one guy's thesis), is tiny in code size, handily keeps up with (or even beats) the existing backends and provides more flexibility. –  Jan 23 '12 at 19:33
  • 2
    That's the big thing for me: writing the LLVM backend was, very clearly, astronomically simpler than writing and maintaining the GCC backend. – Louis Wasserman Jan 24 '12 at 04:07
  • I claim it's not possible to produce idiomatic C code and still support accurate GC and tail calls. – Simon Marlow Feb 07 '12 at 11:58
  • The LLVM backend is comparable in complexity to the *unregisterised* C backend - that is, the C backend without tail-calls and support for pinning certain VM registers to machine registers. Supporting these in the C backend is what added most of the complexity, and we get these in LLVM without much fuss. – Simon Marlow Feb 07 '12 at 12:00
8

One point that hasn't been brought up yet is, how close is your language to C? If you're compiling a fairly low-level imperative language, C's semantics may map very closely to the language you're implementing. If that's the case, it's probably a win, because the code written in your language is likely to resemble the kind of code someone would write in C by hand. That was definitely not the case with Haskell's C backend, which is one reason why the C backend optimized so poorly.

Another point against using a C backend is that C's semantics are actually not as simple as they look. If your language diverges significantly from C, using a C backend means you're going to have to keep track of all those infuriating complexities, and possibly differences between C compilers as well. It may be easier to use LLVM, with its simpler semantics, or devise your own backend, than keep track of all that.

Daniel Lyons
  • 22,421
  • 2
  • 50
  • 77
  • 1
    I agree with your first sentence but not your second para. Even if C semantics are complex from a standpoint of writing an implementation or interfacing with arbitrary C code, things are much easier when you're just using a limited subset of the language. The only way I can think someone naive might screw it up is by violating aliasing rules or being unaware that certain arithmetic ops invoke UB. – R.. GitHub STOP HELPING ICE Jan 23 '12 at 19:34
  • 7
    I think the first step on the road to perdition is believing that C's semantics are simple. :) – Daniel Lyons Jan 23 '12 at 20:38
  • Only in very first order. If you look closer, it already stumbles on something simple as pascal with its nested procedures and the ability to pass them (as procvar) that keep access to the parent's stack frame. – Marco van de Voort Nov 05 '12 at 13:35
8

Aside form all the codegenerator quality reasons, there are also other problems:

  1. The free C compilers (gcc, clang) are a bit Unix centric
  2. Support more than one compiler (e.g. gcc on Unix and MSVC on Windows) requires duplication of effort.
  3. compilers might drag in runtime libraries (or even *nix emulations) on Windows that are painful. Two different C runtimes (e.g. linux libc and msvcrt) to base on complicate your own runtime and its maintenance
  4. You get a big externally versioned blob in your project, which means a major version transition (e.g. a change of mangling could hurts your runtime lib, ABI changes like change of alignment) might require quite some work. Note that this goes for compiler AND externally versioned (parts of the) runtime library. And multiple compilers multiply this. This is not so bad for C as backend though as in the case where you directly connect to (read: bet on) a backend, like being a gcc/llvm frontend.
  5. In many languages that follow this path, you see Cisms trickle through into the main language. Of course this won't happy to you, but you will be tempted :-)
  6. Language functionality that doesn't directly map to standard C (like nested procedures, and other things that need stack fiddling) are difficult.
  7. If something is wrong, users will be confronted with C level compiler or linker errors that are outside their field of experience. Parsing them and making them your own is painful, specially with multiple compilers and -versions

Note that point 4 also means that you will have to invest time to just keep things working when the external projects evolve. That is time that generally doesn't really go into your project, and since the project is more dynamic, multiplatform releases will need a lot of extra release engineering to cater for change.

So in short, from what I've seen, while such a move allows a swift start (getting a reasonable codegenerator for free for many architectures), there are downsides. Most of them are related to loss of control and poor Windows support of *nix centric projects like gcc. (LLVM is too new to say much on long term, but their rhetoric sounds a lot like gcc did ten years ago). If a project you are hugely dependent on keeps a certain course (like GCC going to win64 awfully slow), then you are stuck with it.

First, decide if you want to have serious non *nix ( OS X being more unixy) support, or only a Linux compiler with a mingw stopgap for Windows? A lot of compilers need first rate Windows support.

Second, how finished must the product become? What's the primary audience ? Is it a tool for the open source developer that can handle a DIY toolchain, or do you want to target a beginner market (like many 3rd party products, e.g. RealBasic)?

Or do you really want to provide a well rounded product for professionals with deep integration and complete toolchains?

All three are valid directions for a compiler project. Ask yourself what your primary direction is, and don't assume that more options will be available in time. E.g. evaluate where projects are now that chose to be a GCC frontend in the early nineties.

Essentially the unix way is to go wide (maximize platforms)

The complete suites (like VS and Delphi, the latter which recently also started to support OS X and has supported linux in the past) go deep and try maximize productivity. (support specially the windows platform nearly completely with deep levels of integration)

The 3rd party projects are less clear cut. They go more after self-employed programmers, and niche shops. They have less developer resources, but manage and focus them better.

Marco van de Voort
  • 25,628
  • 5
  • 56
  • 89
7

As you mentioned, whether C is a good target language depends very much on your source language. So here's a few reasons where C has disadvantages compared to LLVM or a custom target language:

  • Garbage Collection: A language that wants to support efficient garbage collection needs to know extra information that interferes with C. If an allocation fails, the GC needs to find which values on the stack and in registers are pointers and which aren't. Since the register allocator is not under our control we need to use more expensive techniques such as writing all pointers to a separate stack. This is just one of many issues when trying to support modern GC on top of C. (Note that LLVM also still has some issues in that area, but I hear it's being worked on.)

  • Feature mapping & Language-specific optimisations: Some languages rely on certain optimisations, e.g., Scheme relies on tail-call optimisation. Modern C compilers can do this but are not guaranteed to do this which could cause problems if a program relies on this for correctness. Another feature that could be difficult to support on top of C is co-routines.

    Most dynamically typed languages also cannot be optimised well by C-compilers. For example, Cython compiles Python to C, but the generated C uses calls to many generic functions which are unlikely to be optimised well even by latest GCC versions. Just-in-time compilation ala PyPy/LuaJIT/TraceMonkey/V8 are much more suited to give good performance for dynamic languages (at the cost of much higher implementation effort).

  • Development Experience: Having an interpreter or JIT can also give you a much more convenient experience for developers -- generating C code, then compiling it and linking it, will certainly be slower and less convenient.

That said, I still think it's a reasonable choice to use C as a compilation target for prototyping new languages. Given that LLVM was explicitly designed as a compiler backend, I would only consider C if there are good reasons not to use LLVM. If the source-level language is very high-level, though, you most likely need an earlier higher-level optimisation pass as LLVM is indeed very low-level (e.g., GHC performs most of its interesting optimisations before generating calling into LLVM). Oh, and if you're prototyping a language, using an interpreter is probably easiest -- just try to avoid features that rely too much on being implemented by an interpreter.

nominolo
  • 5,085
  • 2
  • 25
  • 31
5

Personally I would compile to C. That way you have a universal intermediary language and don't need to be concerned about whether your compiler supports every platform out there. Using LLVM might get some performance gains (although I would argue the same could probably be achieved by tweaking your C code generation to be more optimizer-friendly), but it will lock you in to only supporting targets LLVM supports, and having to wait for LLVM to add a target when you want to support something new, old, different, or obscure.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • This is a great answer if you're also concerned about portability, which is also an important consideration. – Matt Jan 23 '12 at 18:17
  • 1
    I agree - should I ever implement a language, a ridiculously portable C89 implementation will be the first item on my list, long before a clever JIT compiler using LLVM. –  Jan 23 '12 at 18:51
  • Can you name any targets that LLVM doesn't support that GCC does? – Louis Wasserman Jan 24 '12 at 04:05
  • 2
    C means C, not GCC. There are probably over 100 different C compilers in the world (at least if you only demand C89 and not C99 or C11). Combined, they support a lot more targets than GCC or LLVM could ever. – R.. GitHub STOP HELPING ICE Jan 24 '12 at 04:19
  • Targeting "general C" instead of a general compiler means you'll never actually release a finished product, but only a construction kit. At best, a puzzle at worst :) – Marco van de Voort Jan 28 '12 at 11:02
  • 1
    @MarcovandeVoort: That makes no sense. If you write C, then unless you do really stupid stuff, it automatically works on almost any compiler (barring really stupid pathological things and implementation is allowed to do, and the whole translation limits/"at least one program" issue). – R.. GitHub STOP HELPING ICE Jan 28 '12 at 13:31
  • The point is more that there is more to creating a compiler that has a C codegenerator than generating standard C. Think parameters, build systems, runtime library issues + versioning etc etc. – Marco van de Voort Jan 28 '12 at 13:57
4

As far as I know, C can't query or manipulate processor flags.

Lemming
  • 41
  • 1
3

This answer is a rebuttal to some of the points made against C as a target language.

  1. Tail call optimizations

    Any function that can be tail call optimized is actually equivalent to an iteration (it's an iterative process, in SICP terminology). Additionally, many recursive functions can and should be made tail recursive, for performance reasons, by using accumulators etc.

    Thus, in order for your language to guarantee tail call optimization, you would have to detect it and simply not map those functions to regular C functions - but instead create iterations from them.

  2. Garbage collection

    It can be actually implemented in C. You can create a run-time system for your language which consists of some basic abstractions over the C memory model - using for example your own memory allocators, constructors, special pointers for objects in the source language, etc.

    For example instead of employing regular C pointers for the objects in the source language, a special structure could be created, over which a garbage collection algorithm could be implemented. The objects in your language (more accurately, references) - could behave just like in Java, but in C they could be represented along with meta-information (which you wouldn't have in case you were working just with pointers).

    Of course, such a system could have problems integrating with existing C tooling - depends on your implementation and trade-offs that you're willing to make.

  3. Lacking operations

    hippietrail noted that C lacks rotate operators (by which I assume he meant circular shift) that are supported by processors. If such operations are available in the instruction set, then they can be added using inline assembly.

    The frontend would in this case have to detect the architecture which it's running for and provide the proper snippets. Some kind of a fallback in the form of a regular function should also be provided.

This answer seems to be addressing some core issues seriously. I'd like to see some more substantiation on which problems exactly are caused by C's semantics.

Community
  • 1
  • 1
corazza
  • 31,222
  • 37
  • 115
  • 186
1

There's a particular case where if you're writing a programming language with strong security* or reliability requirements.

For one, it would take you years to know a big enough subset of C well enough that you know all the C operations you will choose to employ in your compilation are safe and don't evoke undefined behaviour. Secondly, you'd then have to find an implementation of C that you can trust (which would mean a tiny trusted code base, and probably wont be very efficient). Not to mention you'll need to find a trusted linker, OS capable of executing compiled C code, and some basic libraries, all of which would need to be well-defined and trusted.

So in this case you might as well either use assembly language, if you care about about machine independence, some intermediate representation.

*please note that "strong security" here is not related at all to what banks and IT businesses claim to have

1

Is it a good idea to compile a language to C?

No.

...which begs one obvious question: why do some still think compiling via C is a good idea?

Two big arguments in favour of misusing C in this fashion is that it's stable and standardised:

For these and other reasons, there are various half-done, toy-like, lab-experiment, single-site/use, and otherwise-ignominious via-C backends scattered throughout cyberspace - being abandoned, most have succumbed to bit-rot. But there are some projects which do manage to progress to the mainstream, and their success is then used by via-C supporters to further perpetuate this fantasy.

But if you're one of those supporters, feel free to make fantasy into reality - there's that work happening in GCC, or the resurrected LLVM backend for C. Just imagine it: two well-built, well-maintained via-C backends into which the sum of all prior knowledge can be directed.

They just need you.

atravers
  • 455
  • 4
  • 8