10

Here's a simple C file with an enum definition and a main function:

enum days {MON, TUE, WED, THU};

int main() {
    enum days d;
    d = WED;
    return 0;
}

It transpiles to the following LLVM IR:

define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  store i32 0, i32* %1, align 4
  store i32 2, i32* %2, align 4
  ret i32 0
}

%2 is evidently the d variable, which gets 2 assigned to it. What does %1 correspond to if zero is returned directly?

Teymour
  • 1,832
  • 1
  • 13
  • 34
macleginn
  • 321
  • 2
  • 10
  • 1
    What flags did you use to produce this IR? – arrowd Jan 06 '20 at 11:34
  • @arrowd, I installed the latest stable LLVM suite and ran `clang-9 -S -emit-llvm simple.c` – macleginn Jan 06 '20 at 11:37
  • My guess is that Clang performs constant propagation even at `-O0`. – arrowd Jan 06 '20 at 11:43
  • What exactly is propagated here? The function doesn't return a variable with an assigned value. – macleginn Jan 06 '20 at 12:45
  • 1
    I think it has something to do with initialization before `main` (https://godbolt.org/z/kEtS-s). The link shows how the assembly is mapped to the source – Pradeep Kumar Jan 08 '20 at 05:14
  • 2
    @PradeepKumar: Indeed, if you change the name of the function to something other than `main`, the mysterious extra variable disappears. Interestingly, it also disappears if you omit the `return` statement entirely (which is legal for `main` in C and equivalent to `return 0;`). – Nate Eldredge May 02 '20 at 16:38
  • The `enum` seems to be a total red herring; you can also see an unnecessary variable if `main` consists only of `return 0;` or `return 17;` (the extra variable is set to zero in either case). The extra variable ends up in the assembly, too. – Nate Eldredge May 02 '20 at 16:43
  • @NateEldredge, it occurred to me that this may simply be argc, which perhaps needs to be present on the main's stack by default, but the connection with the return statement is puzzling. – macleginn May 02 '20 at 17:12
  • 1
    @macleginn: I'm not so sure. If you declare `main` as `int main(int argc, char **argv)` you see `argc` and `argv` copied onto the stack, but the mysterious zero variable is still there in addition to them. – Nate Eldredge May 02 '20 at 17:18

3 Answers3

5

This %1 register was generated by clang to handle multiple return statements in a function. Imagine you were writing a function to compute an integer's factorial. Instead of this

int factorial(int n){
    int result;
    if(n < 2)
      result = 1;
    else{
      result = n * factorial(n-1);
    }
    return result;
}

You'd probably do this

int factorial(int n){
    if(n < 2)
      return 1;
    return n * factorial(n-1);
}

Why? Because Clang will insert that result variable that holds the return value for you. Yay. That's the reason for that %1 variable. Look at the ir for a slightly modified version of your code.

Modified code,

enum days {MON, TUE, WED, THU};

int main() {
    enum days d;
    d = WED;
    if(d) return 1;
    return 0;
}

IR,

define dso_local i32 @main() #0 !dbg !15 {
    %1 = alloca i32, align 4
    %2 = alloca i32, align 4
    store i32 0, i32* %1, align 4
    store i32 2, i32* %2, align 4, !dbg !22
    %3 = load i32, i32* %2, align 4, !dbg !23
    %4 = icmp ne i32 %3, 0, !dbg !23
    br i1 %4, label %5, label %6, !dbg !25

 5:                                                ; preds = %0
   store i32 1, i32* %1, align 4, !dbg !26
   br label %7, !dbg !26

 6:                                                ; preds = %0
  store i32 0, i32* %1, align 4, !dbg !27
  br label %7, !dbg !27

 7:                                                ; preds = %6, %5
  %8 = load i32, i32* %1, align 4, !dbg !28
  ret i32 %8, !dbg !28
}

Now you see %1 making itself useful huh? Most functions with a single return statement will have this variable stripped by one of llvm's passes.

droptop
  • 1,372
  • 13
  • 24
1

Why does this matter — what's the actual problem?

I think the deeper answer you're looking for might be: LLVM's architecture is based around fairly simple frontends and many passes. The frontends have to generate correct code, but it doesn't have to be good code. They can do the simplest thing that works.

In this case, Clang generates a couple of instructions that turn out not to be used for anything. That's generally not a problem, because some part of LLVM will get rid of superfluous instructions. Clang trusts that to happen. Clang doesn't need to avoid emitting dead code; its implementation may focus on correctness, simplicity, testability, etc.

arnt
  • 8,949
  • 5
  • 24
  • 32
1

Because Clang is done with syntax analysis but LLVM hasn't even started with optimization.

The Clang front end has generated IR (Intermediate Representation) and not machine code. Those variables are SSAs (Single Static Assignments); they haven't been bound to registers yet and actually after optimization, never will be because they are redundant.

That code is a somewhat literal representation of the source. It is what clang hands to LLVM for optimization. Basically, LLVM starts with that and optimizes from there. Indeed, for version 10 and x86_64, llc -O2 will eventually generate:

main: # @main
  xor eax, eax
  ret
Olsonist
  • 2,051
  • 1
  • 20
  • 35
  • I understand the process on this level. I wanted to know why this IR was generated to begin with. – macleginn May 02 '20 at 17:20
  • You may be thinking of a compiler as a single pass. There is a pipeline of passes starting with the Clang front end which generates IR. It didn't even generate this textual IR which instead someone requested with clang -emit-llvm -S file.cpp Clang actually generated a binary serializable bitcode version of the IR. LLVM is structured as multiple passes, each taking and optimizing IR. The first LLVM pass takes IR from Clang. It takes IR because you can replace Clang with the Fortran FE in order to support another language with the same optimizer + code generator. – Olsonist May 02 '20 at 18:02