How does LLVM avoid generating redundant native code for the `br` IR instruction?

Question

For the following C code

void foo() {
    int forty_two = 42;
    if (forty_two == 42) {
    }
}

clang -S -emit-llvm foo.c emits this IR code:

define dso_local void @foo() #0 {
  %1 = alloca i32, align 4
  store i32 42, i32* %1, align 4
  %2 = load i32, i32* %1, align 4
  %3 = icmp eq i32 %2, 42
  br i1 %3, label %4, label %5

4:                                                ; preds = %0
  br label %5

5:                                                ; preds = %4, %0
  ret void
}

And for the IR, LLVM (llc foo.ll) generates the x64 code below. It's abridged for readability.

foo:                                    # @foo
# %bb.0:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $42, -4(%rbp)
    cmpl    $42, -4(%rbp)
    jne .LBB0_2
# %bb.1:
    jmp .LBB0_2
.LBB0_2:
    popq    %rbp
    retq

In contrast to the native code emitted by LLVM, translating the IR code in a straightforward way would contain a number of redundant instructions. Something along these lines:

foo:
# %bb.0:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $42, -4(%rbp)
    cmpl    $42, -4(%rbp)

# create the i1 boolean in a register.
# (This instruction is redundant and LLVM doesn't emit is)
    sete    %al
# See whether the comparison's result was `true` or `false`.
# (This instruction is redundant and LLVM doesn't emit it)
    cmpb    $1, %al

    jne .LBB0_2
# %bb.1:
    jmp .LBB0_2
.LBB0_2:
    popq    %rbp
    retq

My question is: Where is the portion of LLVM code that makes sure these redundant instructions are not emitted? And how does it work?

I read an excellent post Life of an instruction in LLVM by @Eli Bendersky and looked at the code in SelectionDAGBuilder::visitICmp and SelectionDAGBuilder::visitBr. But didn't manage to figure out the answer on my own.

If LLVM was going to materialize a boolean compare result and test it, it might only be 8-bit width, like `sete %al` / `test %al, %al`+`jz`. (Are you sure it would compare for equality to 1, not just being non-zero?) Not that that's relevant to your actual question; I don't know LLVM internals so IDK how it turns that IR into only creating the condition in FLAGS. — Peter Cordes, Jun 09 '21 at 01:10
@PeterCordes, you are right, that comparison to 1 is just off the top of my head to illustrate the question. LLVM would probably materialize the comparison as `test %al, %al` — Myk, Jun 09 '21 at 08:31
@PeterCordes I'm thinking about editing the assembly to replace `cmpl $1, %eax` with `test %al, %al` but a nice property of `cmpl` is it lets me keep the portions of asm actually generated by LLVM intact. — Myk, Jun 09 '21 at 08:33
Sure, that's reasonable. You could drop the zero-extension; the LLVM-IR is using `i1` (a 1-bit type). Also the AND: LLVM must know that `setcc` produces a 0/1, or at least setcc is a clear way to show materializing an `i1` anyway. I went ahead and made the edit. — Peter Cordes, Jun 09 '21 at 08:34
(Fun fact: `cmp $1, %al` is 2 bytes, and just as efficient as `test %al,%al` in this case, because of the special case AL,imm8 encoding with no ModRM.) — Peter Cordes, Jun 09 '21 at 08:39
Hi @PeterCordes So, I was revisiting this question, for no particular reason, really. And I'd just like to take a moment to thank you for dedicating your time to help people like me with their questions. I feel really lucky, that there is a place I can ask a question and get help from someone insanely knowledgeable like yourself, paxdiablo, arnt and others. — Myk, Jun 13 '23 at 21:56

score 1 · Answer 1 · answered Jun 09 '21 at 05:23

1

LLVM runs passes that change the code in beneficial ways. Each pass decides what "beneficial" means. Am I right in assuming that you're more interested in a generic answer and are using that br as an example? If so, the -print-after-all flag, which instructs the compiler to print the IR after each of the passes, may be the what you want. There's also a -print-before-all and more specific flags.

Reading the output and seeing how it changes gfives you a good overview of which passes add/eliminate which warts.

answered Jun 09 '21 at 05:23

arnt

8,949
5
24
32

1

https://godbolt.org/z/5Ga56j4oP shows `llvm-as --print-after-all -O0` output for the source in the question. The last pure LLVM-IR pass still has `br i1 %3, label %4, label %5` and `br i1 %3, label %4, label %5` - they only merge into a `cmp/jcc` (only materializing `%3` in EFLAGS) during translation to x86 instructions. Which makes some sense - even in debug builds, you don't want the compiler to materialize booleans in integer regs and then `test` them. That would make the asm really noisy and bloated. – Peter Cordes Jun 09 '21 at 05:37
Thank you, it's a good answer with useful info. But I'm looking specifically for the way LLVM "optimizes" `icmp` + `br` pairs. What makes the question interesting is, afaict, there isn't much to optimize in a pair of `icmp` + `br` in terms of IR. And it terms of asm, `X86PassConfig` has some x86 specific optimization passes, but not the one the question is asking about. Plus, most likely, the technique isn't x86 specific. – Myk Jun 09 '21 at 08:35
1

I see; sorry for the misunderstanding. I think I've seen this be optimised, though — some pass turned suboptimal icmp/br/phi code into a single [select](https://llvm.org/docs/LangRef.html#select-instruction). – arnt Jun 09 '21 at 08:44
1

Oh, lots of passes call SelectInst::Create(). I think I'll do some useful work instead of looking for wich it may have been and the conditions under which that may happen. Sigh. – arnt Jun 09 '21 at 08:56
3

LLVM backends run their own CFG optimisation passes, which do exactly this (e.g., cmp and br fusion). It's not possible to do it before. Same with the out-of-SSA, it only makes sense in the backend, when combined with register allocation. – SK-logic Jun 11 '21 at 06:52
Turns out, I should've paid more attention to the `-print-after-all` flag, that you mention. Not sure, why I thought it'd show only IR-related passes. As it shows much more and eventually helped me find the spot in code I was looking for. Thank you! – Myk Jun 16 '21 at 23:58
Hi, @arnt I was revisiting this question. For no particular reason, really. I re-read your answer. I'd just like to take a moment to thank you for your answer, telling me about the very useful `-print-after-all` flag and posting a link to that article outlining the process of function optimization optimization in LLVM. And, you know, also for in general dedicating your time to helping people like me with their questions :) – Myk Jun 13 '23 at 21:46

Myk · Accepted Answer · 2021-06-17T00:06:21.703

TLDR: X86FastISel::X86SelectBranch

Someone on LLVM's discord told me about the -print-after-all flag of llc. (Actually, @arnt mentioned it in their answer even before I asked on discord and I have no idea why I didn't give the flag a try right away...)

That flag let me see that "X86 DAG->DAG Instruction Selection" was the first pass that didn't only transform IR but turned it into x86-specific Machine IR (MIR). (The corresponding class is X86DAGToDAGISel).
And from the MIR it emitted, it was clear the decision to emit or not SETCC/TEST instructions happens during the pass run.

Stepping through X86DAGToDAGISel::runOnMachineFunction eventually brought me to X86FastISel::X86SelectBranch. There,

in case br is the only user of icmp's result and the instructions are in the same basic block, the pass decides not to emit SETCC/TEST
in case icmp's result has other users or the two IR instructions aren't in the same basic block, the pass will actually emit SETCC/TEST.

So, for this C code:

void foo() {
    int forty_two = 42;
    int is_forty_two;
    if (is_forty_two = (forty_two == 42)) {
    }
}

clang -S -emit-llvm brcond.c produces the following IR:

define void @foo() #0 {
entry:
  %forty_two = alloca i32, align 4
  %is_forty_two = alloca i32, align 4
  store i32 42, i32* %forty_two, align 4
  %0 = load i32, i32* %forty_two, align 4
  %cmp = icmp eq i32 %0, 42
  %conv = zext i1 %cmp to i32
  store i32 %conv, i32* %is_forty_two, align 4
  br i1 %cmp, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  ret void
}

Obviously, %cmp has more than one user. So llc brcond.ll emits the assembly below (abridged a bit):

foo:                                    # @foo
# %bb.0:                                # %entry
    pushq   %rax
    movl    $42, (%rsp)
    cmpl    $42, (%rsp)
    sete    %al
    movb    %al, %cl
    andb    $1, %cl
    movzbl  %cl, %ecx
    movl    %ecx, 4(%rsp)
    testb   $1, %al
    jne .LBB1_1
    jmp .LBB1_2
.LBB1_1:                                # %if.then
    jmp .LBB1_2
.LBB1_2:                                # %if.end
    popq    %rax
    retq

I only mentioned `--print-after-all` after arnt posted that as an answer! I didn't know about it before that. — Peter Cordes, Jun 16 '21 at 22:56
@PeterCordes, you're right. I'm going to edit my answer to mention it was arnt who brought up the flag. Somehow I thought the flag was only for IR-related passes. Wereas it actually makes llc log all passes, afaict. — Myk, Jun 17 '21 at 00:02

How does LLVM avoid generating redundant native code for the `br` IR instruction?

2 Answers2