6

I know that compiler can have many front-ends. Each front-end translates the code written in a programming language to an internal data structure.

Then inside that data structure the compiler makes some optimizations.

Then the BACK-END of the compiler translates that data structure into assembly code and then in the assembly phase the assembly code is translated into object code.

My question is the following.

Considering the fact that any programming language is translated into that internal data structure, is the final code outputted by the compiler the same for the same program logic but for DIFFERENT programming languages?

  • 3
    It's not necessarily the same for the same logic even in the same language ;) But it may be. – Jester Sep 22 '17 at 14:18
  • @jester: i think that is the only plausible answer to this question. – rici Sep 22 '17 at 14:32
  • 2
    It's like with different communities: even if they speak the same language (assembly) they could not be able to cooperate seamlessly due to different dialects (name mangling), infrastructures (calling conventions) or cultures (programming languages concepts like pointers). It's possible, just not always immediate. – Margaret Bloom Sep 22 '17 at 14:35
  • @MargaretBloom Thank you for the analogy. interesting point. –  Sep 22 '17 at 14:44
  • @MargaretBloom: good point about name mangling and other stuff. I was only thinking about the actual instructions emitted for a case where it might be reasonable to *hope* that you'd get the same output. (i.e. not with different struct layouts). – Peter Cordes Sep 22 '17 at 14:50

2 Answers2

5

Yes, that's likely. But subtle differences between languages can result in different asm from similar-looking source. It's rare that the front-end will give the back-end exactly the same inputs. It may end up optimized the same for simple functions, and will generally use the same kinds of strategies for things. (e.g. on x86 how many LEA instructions it's worth using instead of a multiply.)

e.g. in C, signed overflow is undefined behaviour, so

void foo(int *p, int n) {
    for (int i = 0; i <= n ; i++) {
         p[i] = i/4;
     }
}

can be assumed to terminate eventually for all possible n (including INT_MAX), and for i to be non-negative.

With a front-end for a language where i++ is defined to have 2's complement wrap-around (or gcc with -fwrapv -fno-strict-overflow), i would go from ==INT_MAX to a large negative, always <= INT_MAX. The compiler would be required to make asm that faithfully implements the source code's behaviour even for callers that pass n == INT_MAX, making this an infinite loop where i can be negative.

But since that's Undefined Behaviour in C and C++, the compiler can assume the program doesn't contain any UB, and thus that no caller can pass INT_MAX. It can assume that i is never negative inside the loop, and that the loop trip-count fits in a int. See also What Every C Programmer Should Know About Undefined Behavior (clang blog).


The non-negative assumption lets it implement i / 4 with a simple right-shift, rather than implementing C integer division semantics for negative numbers.

# the p[i] = i/4;  part of the inner loop from
# gcc -O3  -fno-tree-vectorize
    mov     edx, eax                        # copy the loop counter
    sar     edx, 2                          # i / 4 == i>>2
    mov     DWORD PTR [rdi+rax*4], edx      # store into the array

Source + asm output on the Godbolt compiler explorer.

But if signed wrap-around is defined behaviour, signed division by a constant takes more instructions, and array indexing has to account for the possible wrapping:

# Again *just* the body of the inner loop, without the loop overhead
# gcc -fno-strict-overflow -fwrapv    -O3 -fno-tree-vectorize
    test    eax, eax           # set flags (including SF) according to i
    lea     edx, [rax+3]       # edx = i+3
    movsx   rcx, eax           # sign-extend for use in the addressing mode
    cmovns  edx, eax           # copy if !signbit_set(i)
    sar     edx, 2             # i/4 = i>=0 ? i>>2 : (i+3)>>2;
    mov     DWORD PTR [rdi+rcx*4], edx

C array-indexing syntax is just sugar for pointer + integer, and doesn't require that the index is non-negative. So it's valid for the caller to pass a pointer to the middle of a 4GB array which this function must eventually write. (Infinite loops are questionable, too, but NVM that.)

As you can see, a tiny difference in language rules required the compiler to not optimize. Differences between language rules are usually larger than the difference between ISO C++ and the defined-signed-wraparound flavour of C++ that g++ can implement.

Also, if the "usual" types are different widths or signedness in another language, it's very likely that the back-end will get different input, and in some cases that will matter.

If I had used unsigned, wraparound would be the defined overflow behaviour in C and C++. But unsigned types by definition are non-negative, so the possibility of wraparound wouldn't have such an obvious effect on optimizations without unrolling. If the loop had started from greater than zero, then wraparound introduces the possibility of coming back to 0, in case that matters (e.g. x / i is a division by zero).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

Yes, it is possible that code compiled in different languages results in the same final assembly.

Same or Similar Code

For example, if the front-end for two different languages produces the same intermediate code and metadata1, and the same optimizations phases are applied then it should be guaranteed that the back-end then produces the same code. This is very easy to see in the case of closely related languages such as C and C++ where the same or similar code often produces identical code.

Here's a trivial example using C code to increment a pointer and C++ code to increment a reference.

Increment in C

Source

void inc(int* p) {
    (*p)++;
}

Final Assembly

In gcc at -O2

inc:
        add     DWORD PTR [rdi], 1
        ret

Play with the assembly here yourself in gcc and clang.

C++

Similar code, but use the C++ reference feature rather than passing a pointer.

Source

void inc(int& p) { p++; }

Assembly

In g++ with -O2

inc(int&):
        add     DWORD PTR [rdi], 1
        ret

Play with it here on godbolt.

The assembly produced in either case was identical, despite using different languages and different language features (references in the case of C++, which aren't available in C++).

Note also that clang - a completely separate toolchain, produced different code than gcc - using inc rather than add, but the produced code was against consistent between C and C++.

Different Code

More interestingly, even wildly different code, in different languages may produce the same final assembly. Even if the front-end produces very different intermediate code, optimization passes may eventually reduce both inputs to the same output. It's certainly not guaranteed for any particular input though, and it will vary a lot by compiler and platform.


1 By metadata, I mean anything aside from the intermediate instructions themselves, which might affect code generation. For example, some languages may allow fewer optimizations such as memory re-ordering, or have other behaviors that vary (Peter points out signed overflow). It isn't clear to me if all of these differences are encoded directly in the intermediate language, or if there is also so metadata associated with each bunch of intermediate code that describes specific semantics the optimization phases and back-end must respect.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 1
    I'd guess that different compilers have different ways of passing language rules to the backend. Some maybe with explicit stuff in a generic metadata format, others with switches/modes for the back-end. BTW, if you want to get fancier, Matt Godbolt's compiler-explorer site has other languages, including Rust and D which are even more different than C and C++. – Peter Cordes Sep 22 '17 at 19:38
  • Huh, I didn't include any other languages because "godbolt" doesn't support them but I hadn't noticed new langs had been added! Thanks @PeterCordes – BeeOnRope Sep 22 '17 at 21:27