7

while playing around with godbolt.org I noticed that gcc (6.2, 7.0 snapshot), clang (3.9) and icc (17) when compiling something close to

int a(int* a, int* b) {
  if (b - a < 2) return *a = ~*a;

  // register intensive code here e.g. sorting network
}

compiles (-O2/-O3) this into somthing like this:

    push    r15
    mov     rax, rcx
    push    r14
    sub     rax, rdx
    push    r13
    push    r12
    push    rbp
    push    rbx
    sub     rsp, 184
    mov     QWORD PTR [rsp], rdx
    cmp     rax, 7
    jg      .L95
    not     DWORD PTR [rdx]
 .L162:
    add     rsp, 184
    pop     rbx
    pop     rbp
    pop     r12
    pop     r13
    pop     r14
    pop     r15
    ret

which obviously has a huge overhead in case of b - a < 2. In case of -Os gcc compiles to:

    mov     rax, rcx
    sub     rax, rdx
    cmp     rax, 7
    jg      .L74
    not     DWORD PTR [rdx]
    ret
.L74:

Which leads me to beleave that there is no code keeping the compiler from emitting this shorter code.

Is there a reason why compilers do this ? Is there a way to get them compiling to the shorter version without compiling for size?


Here's an example on Godbolt that reproduces this. It seems to have something to do with the complex part being recursive

fuz
  • 88,405
  • 25
  • 200
  • 352
Christoph Diegelmann
  • 2,004
  • 15
  • 26
  • BTW: x86-64 clang 3.9.0 compiles to a very short version with -O1 in godbold, which is somewhat contradictory to what you write in your question. You should not write _when compiling something close to_ but _when compiling_ and submit the _actual_ code or a [MCVE]. – Jabberwocky Oct 20 '16 at 08:58
  • 2
    This is a more or less known weakness of current compilers. It would be much better to do the early-out test before saving all the registers the whole function will need, so they don't need to be popped. One workaround is to pull the early-out test into a wrapper function. That's especially useful if the wrapper can inline. – Peter Cordes Oct 20 '16 at 08:58
  • You might get what you want with `if (__bultin_expect(b - a < 2, 1))`. Also making the function available for inlining (or using whole program optimization) might allow GCC to partial inline the if statement. – Ross Ridge Oct 20 '16 at 09:01
  • 2
    It would be good if you could cook up an actual example that compiles, and link that on Godbolt (with a full-link to prevent link-rot from url shortening). What you've shown in the question is good as an example, though. @MichaelWalz: You see code like this all the time if you look at compiler output on real code. Of course you get a trivial function if you leave out the `//... complicated-code-here` part. – Peter Cordes Oct 20 '16 at 09:02
  • 1
    @MichaelWalz the actual code is around 800 lines and not publicly available currently. I'm trying to make a minimum code reproducing the problem but it's quite hard. – Christoph Diegelmann Oct 20 '16 at 09:02
  • 1
    @Christoph: just look for a totally different function in something open source that has an early-out at the top. BTW, this bloating of the fast-path is one use-case for `__attribute__((noinline))` in the Linux kernel: put the register-intensive general case with error handling in another function and prevent it from inlining, so there aren't push/pops on the fast path. (Where the fast path through the function is pretty short, like your early-out.) – Peter Cordes Oct 20 '16 at 09:05
  • @Cordes It seems that this is related to recursion. I've made available a short piece of code on godbolt. – Christoph Diegelmann Oct 20 '16 at 09:07
  • 2
    If it's when it recurses that its most likely for the if statement to be true, then splitting the function like Peter Cordes suggests should work well even if the wrapper is only inlined into the register intensive function. – Ross Ridge Oct 20 '16 at 09:13
  • @Christoph: `@cordes` doesn't notify me. I think it only matches on username prefixes, not within usernames. BTW, [here's why you should always post full links, not godbolt short-links, if you have room (i.e. not a comment)](http://meta.stackoverflow.com/a/319594/224132). This is why I specifcally suggested posting a full link. – Peter Cordes Oct 20 '16 at 09:48

1 Answers1

2

This is a known compiler limitation, see my comments on the question. IDK why it exists; maybe it's hard for compilers to decide what they can do without spilling when they haven't finished saving regs yet.

Pulling the early-out check into a wrapper is often useful when it's small enough to inline.


Looks like modern gcc can actually sidestep this compiler limitation sometimes.

Using your example on the Godbolt compiler explorer, adding a second caller is enough to get even gcc6.1 -O2 to split the function for you, so it can inline the early-out into the second caller and into the externally visible square() (which ends with jmp square(int*, int*) [clone .part.3] if the early-out return path isn't taken).

code on Godbolt, note I added -std=gnu++14, which is required for clang to compiler your code.

void square_inlinewrapper(int* a, int* b) {
  //if (b - a < 16) return;  // gcc inlines this part for us, and calls a private clone of the function!

  return square(a, b);
}

# gcc6.1 -O2  (default / generic -march= and -mtune=)
    mov     rax, rsi
    sub     rax, rdi
    cmp     rax, 63
    jg      .L9
    rep ret
.L9:
    jmp     square(int*, int*) [clone .part.3]

square() itself compiles to the same thing, calling the private clone which has the bulk of the code. The recursive calls from inside the clone call the wrapper function, so they don't do the extra push/pop work when it's not needed.


Even gcc7 doesn't do this when there's no other caller, even at -O3. It does still transform one of the recursive calls into a loop, but the other one just calls the big function again.


Clang 3.9 and icc17 don't clone the function, either, so you should write the inlineable wrapper manually (and change the main body of the function to use it for recursive calls, if the check is needed there).

You might want to name the wrapper square, and rename just the main body to a private name (like static void square_impl).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Do you know if someone gathered such known compiler limitations? – Christoph Diegelmann Oct 20 '16 at 13:36
  • 1
    @Christoph: I don't know of a list anywhere. It would be really hard to keep anything like that up to date, since every entry would basically be a missed-optimization bug that could be fixed at any time for one compiler but not others. You can search gcc's bugzilla for [missed-optimization bugs](https://gcc.gnu.org/bugzilla/buglist.cgi?cf_known_to_fail_type=allwords&cf_known_to_work_type=allwords&keywords=missed-optimization%2C%20&keywords_type=allwords&list_id=162423&product=gcc&query_format=advanced&resolution=---), and get a list that hits the limit of 500 search results... – Peter Cordes Oct 21 '16 at 03:04