7

Consider the simple code:

UINT64 result;
UINT32 high, low;
...
result = ((UINT64)high << 32) | (UINT64)low;

Do modern compilers turn that into a real barrel shift on high, or optimize it to a simple copy to the right location?

If not, then using a union would seem to be more efficient than the shift that most people appear to use. However, having the compiler optimize this is the ideal solution.

I'm wondering how I should advise people when they do require that extra little bit of performance.

timrau
  • 22,578
  • 4
  • 51
  • 64
Adam Davis
  • 91,931
  • 60
  • 264
  • 330
  • 7
    I'd advise them to try both, and time them. It's very difficult to predict the code compilers will emit, and even more difficult to predict how "efficient" it will be. –  May 25 '11 at 18:05
  • 2
    I think you mean a bitwise `or` (`|`), not a logical `or` (`||`). – Fred Larson May 25 '11 at 18:11
  • 2
    It is also easy to predict that simple one line "optimizations" are already well known to the compiler writers, and will hardly ever make a difference. If you are hard pressed on performance, by all means try it, but don't hold your breath! – Bo Persson May 25 '11 at 18:13

4 Answers4

4

Modern compilers are smarter than what you might think ;-) (so yes, I think you can expect a barrel shift on any decent compiler).

Anyway, I would use the option that has a semantic closer to what you are actually trying to do.

fortran
  • 74,053
  • 25
  • 135
  • 175
4

If this supposed to be platform independent then the only option is to use shifts here.

With union { r64; struct{low;high}} you cannot tell on what low/high fields will map to. Think about endianess.

Modern compilers are pretty good handling such shifts.

c-smile
  • 26,734
  • 7
  • 59
  • 86
4

I wrote the following (hopefully valid) test:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

void func(uint64_t x);

int main(int argc, char **argv)
{
#ifdef UNION
  union {
    uint64_t full;
    struct {
      uint32_t low;
      uint32_t high;
    } p;
  } result;
  #define value result.full
#else
  uint64_t result;
  #define value result
#endif
  uint32_t high, low;

  if (argc < 3) return 0;

  high = atoi(argv[1]);
  low = atoi(argv[2]);

#ifdef UNION
  result.p.high = high;
  result.p.low = low;
#else
  result = ((uint64_t) high << 32) | low;
#endif

  // printf("%08x%08x\n", (uint32_t) (value >> 32), (uint32_t) (value & 0xffffffff));
  func(value);

  return 0;
}

Running a diff of the unoptimized output of gcc -s:

<   mov -4(%rbp), %eax
<   movq    %rax, %rdx
<   salq    $32, %rdx
<   mov -8(%rbp), %eax
<   orq %rdx, %rax
<   movq    %rax, -16(%rbp)
---
>   movl    -4(%rbp), %eax
>   movl    %eax, -12(%rbp)
>   movl    -8(%rbp), %eax
>   movl    %eax, -16(%rbp)

I don't know assembly, so it's hard for me to analyze that. However, it looks like some shifting is taking place as expected on the non-union (top) version.

But with optimizations -O2 enabled, the output was identical. So the same code was generated and both ways will have the same performance.

(gcc version 4.5.2 on Linux/AMD64)

Partial output of optimized -O2 code with or without union:

    movq    8(%rsi), %rdi
    movl    $10, %edx
    xorl    %esi, %esi
    call    strtol

    movq    16(%rbx), %rdi
    movq    %rax, %rbp
    movl    $10, %edx
    xorl    %esi, %esi
    call    strtol

    movq    %rbp, %rdi
    mov     %eax, %eax
    salq    $32, %rdi
    orq     %rax, %rdi
    call    func

The snippet begins immediately after the jump generated by the if line.

Matthew
  • 47,584
  • 11
  • 86
  • 98
  • 1
    It doesn't really matter what code is generated, what matters is how fast the generated code runs - you need to time it. –  May 25 '11 at 19:01
  • 4
    @Neil, correct. Let's assume a union is faster, for discussion's sake. My point is simply that if you use a union expecting faster results, it doesn't matter because the compiler has "optimized" it to the same thing as a shift. – Matthew May 25 '11 at 19:04
  • When you say the output was identical - which version did it choose for the optimized version, the one that consisted only of moves, or the one with the shift (salq) ? They both consist of 4 moves, so I would assume the optimized version would be the 4 moves alone, getting rid of the extra two operations. – Adam Davis May 25 '11 at 20:49
  • @Adam, I've included the optimized version. (The C source code was changed to just call an arbitrary external function to minimize the extra code generated by the printf parameters.) I don't know enough to tell whether the `strtol` calls are being placed directly in the final destination or not, but it looks to me like there's four operations involving a shift. Anyway, I only wrote this for curiosity's sake... Obviously this is a single trivial instance of the broader question, but it does show that the generated code /might/ be the same. Categorically, my answer would be what @fortran said. – Matthew May 25 '11 at 22:10
  • Interesting! It chose the barrel shifted version as the optimized version. That makes sense as it leaves the result in a register for further use, and the two register operations (shift and or) are both cheaper than memory moves. That makes sense now that I think about it further. Thank you for performing the work to find out! – Adam Davis May 25 '11 at 22:26
2

EDIT: This response is based on an earlier version of the OP's code that did not have a cast

This code

result = (high << 32) | low;

is actually going to have undefined results ... since with high you're shifting a 32-bit value by 32-bits (the width of the value), the results are going to be undefined and will depend on how a compiler and OS platform decide to handle the shift. The results of that undefined shift will then be or'd with low, which again will be undefined since you're or'ing an undefined value against a defined value, and so the end-result will most likely not be a 64-bit value like you want. For instance, the code emitted by gcc -s on OSX 10.6 looks like:

movl    -4(%rbp), %eax      //retrieving the value of "high"
movl    $32, %ecx          
shal    %cl, %eax           //performing the 32-bit shift on "high"
orl    -8(%rbp), %eax       //OR'ing the value of "low" to the shift op result

So you can see that the shift is only taking place on a 32-bit value in a 32-bit register with a 32-bit assembly command ... the results end up being the exact same as high | low without any shifting at all because in this case, shal $32, %eax just returns the value that was originally in EAX. You're not getting a 64-bit result.

In order to avoid that, cast high to a uint64_t like:

result = ((uint64_t)high << 32) | low;
Jason
  • 31,834
  • 7
  • 59
  • 78
  • Thanks for the info ... adjusting my answer above – Jason May 25 '11 at 20:09
  • Yes, I removed the casting for brevity. Due to the number of comments on how it needs to be properly cast I see I should have left it in. – Adam Davis May 25 '11 at 20:50
  • No problem ... I'll be more than happy to delete this post if you add back in the cast, otherwise I think it will be good to leave it so that someone doesn't try the code above as-is right now and then scratch their heads wondering why it doesn't quite work as expected. BTW, I did give you a +1 since I think overall this is a great question (and I learned something myself :-) – Jason May 25 '11 at 20:57
  • Go ahead and leave your answer, even though I have added the casting back in. – Adam Davis May 25 '11 at 21:07