Why don't GCC and Clang use cvtss2sd [memory]?

Question

I'm trying to optimize some code that's supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a significant performance bottleneck, as the code that stores data in memory as single precision is substantially slower than equivalent code that stores data in memory as double precision. Below is a toy C++ program that captures the essence of my issue:

#include <cstdio>

// noinline to force main() to actually read the value from memory.
__attributes__ ((noinline)) float* GetFloat() {
  float* f = new float;
  *f = 3.14;
  return f;
}

int main() {
  float* f = GetFloat();
  double d = *f;
  printf("%f\n", d);  // Use the value so it isn't optimized out of existence.
}

Both GCC and Clang perform the loading of *f and conversion to double precision as two separate instructions even though the cvtss2sd instruction supports memory as the source argument. According to Agner Fog, cvtss2sd r, m executes as fast as movss r, m on most architectures, and avoids needing to execute cvtss2sd r, r afterwords. Nonetheless, Clang generates the following code for main():

main    PROC
        push    rbp                                     ; 
        mov     rbp, rsp                                ; 
        call    _Z8GetFloatv                            ;
        movss   xmm0, dword ptr [rax]                   ; 
        cvtss2sd xmm0, xmm0                             ; 
        mov     edi, offset ?_001                       ; 
        mov     al, 1                                   ; 
        call    printf                                  ; 
        xor     eax, eax                                ; 
        pop     rbp                                     ;
        ret                                             ;
main    ENDP

GCC generates similarly inefficient code. Why don't either of these compilers simply generate something like cvtss2sd xmm0, dword ptr [rax]?

EDIT: Great answer, Stephen Canon! I took the assembly language output of Clang for my real use case, pasted it into a source file as inline ASM, benchmarked it, then made the changes discussed here and benchmarked it again. I couldn't believe that cvtss2sd [memory] is actually slower.

Stephen Canon · Accepted Answer · 2013-05-18T15:27:44.250

This is actually an optimization. CVTSS2SD from memory leaves the high 64 bits of the destination register unchanged. This means that a partial-register update occurs, which can incur a significant stall and greatly reduce ILP in many circumstances. MOVSS, on the other hand, zeros the unused bits of the register, which is dependency-breaking, and avoids the risk of the stall.

You may well have a bottleneck on conversion to double, but this isn't it.

I'll expand a little bit on exactly why the partial register update is a performance hazard.

I have no idea what computation is actually being performed, but let's suppose that it looks like this very simple example:

double accumulator, x;
float y[n];
for (size_t i=0; i<n; ++i) {
    accumulator += x*(double)y[i];
}

The "obvious" codegen for the loop looks something like this:

loop_begin:
  cvtss2sd xmm0, [y + 4*i]
  mulsd    xmm0,  x
  addsd    accumulator, xmm0
  // some loop arithmetic that I'll ignore; it isn't important.

Naively, the only loop-carried dependency is in the accumulator update, so asymptotically the loop should run at a speed of 1/(addsd latency), which is 3 cycles per loop iteration on current "typical" x86 cores (see Agner Fog's tables or Intel's Optimization Manual for more details).

However, if we actually look at the operation of these instructions, we see that the high 64 bits of xmm0, even though they have no effect on the result we are interested in, form a second loop-carried dependency chain. Each cvtss2sd instruction cannot begin until the result of the preceding loop iteration's mulsd is available; this bounds the actual speed of the loop to 1/(cvtss2sd latency + mulsd latency), or 7 cycles per loop iteration on typical x86 cores (the good news is that you only pay the reg-reg conversion latency, because the conversion operation is cracked into two µops, and the load µop does not have a dependency on xmm0, so it can be hoisted).

We can write out the operation of this loop as follows to make it a bit more clear (I'm ignoring the load-half of the cvtss2sd, as those µops are nearly unconstrained and can happen more-or-less whenever):

cycle  iteration 1    iteration 2    iteration 3
------------------------------------------------
0      cvtss2sd
1      .
2      mulsd
3      .
4      .
5      .
6      . --- xmm0[64:127]-->
7      addsd          cvtss2sd(*)
8      .              .
9      .-- accum -+   mulsd
10                |   .
11                |   .
12                |   .
13                |   . --- xmm0[64:127]-->
14                +-> addsd          cvtss2sd
15                    .              .

(*) I'm actually simplifying things a bit; we need to consider not only latency but also port utilization in order to make this accurate. Considering only latency suffices to illustrate the stall in question, however, so I'm keeping it simple. Pretend we are running on a machine with infinite ILP resources.

Now suppose that we write the loop like this instead:

loop_begin:
   movss    xmm0, [y + 4*i]
   cvtss2sd xmm0,  xmm0
   mulsd    xmm0,  x
   addsd    accumulator, xmm0
   // some loop arithmetic that I'll ignore; it isn't important.

Because movss from memory zeros bits [32:127] of xmm0, there is no longer a loop-carried dependency on xmm0, so we are bound by accumulation latency, as expected; execution at steady state looks something like this:

cycle  iteration i    iteration i+1  iteration i+2
------------------------------------------------
0      cvtss2sd       .
1      .              .
2      mulsd          .              movss 
3      .              cvtss2sd       .
4      .              .              .
5      .              mulsd          .
6      .              .              cvtss2sd
7      addsd          .              .
8      .              .              mulsd
9      .              .              .
10     . -- accum --> addsd          .
11                    .              .
12                    .              .
13                    . -- accum --> addsd

Note that in my toy example, there's still a lot more to be done to optimize the code in question after eliminating the partial-register-update stall. It can be vectorized, and multiple accumulators can be used (at the cost of changing the specific rounding that occurs) to minimize the effect of the loop-carried accumulate-to-accumulate latency.

Interesting, but leads to two questions: 1. Why aren't the high bits zeroed? Presumably if you're using this instruction your intent is to write non-vectorized code. 2. GCC and Clang seem to still do this even when the high bits of the xmm register aren't used, i.e. when only using non-vectorized instructions subsequently. Why is this? — dsimcha, May 16 '13 at 21:58
1. Intel chose to do it that way; the why is not terribly important. It is occasionally useful, but probably causes more trouble than its worth. 2. The partial-register update hazard is present even if the high-part of the XMM registers are never used. That's what makes it so insidious. — Stephen Canon, May 17 '13 at 01:06
The more detailed explanation in your edit is terrific! My only question is, why doesn't the CPU's logic track dependencies on the low quadword independently of dependencies on the upper quadword and understand that xxxsd instructions only read from/write to the low quadword of the register? — dsimcha, May 18 '13 at 18:09

Why don't GCC and Clang use cvtss2sd [memory]?

1 Answers1

Linked