I'm trying to optimize some code that's supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a significant performance bottleneck, as the code that stores data in memory as single precision is substantially slower than equivalent code that stores data in memory as double precision. Below is a toy C++ program that captures the essence of my issue:
#include <cstdio>
// noinline to force main() to actually read the value from memory.
__attributes__ ((noinline)) float* GetFloat() {
float* f = new float;
*f = 3.14;
return f;
}
int main() {
float* f = GetFloat();
double d = *f;
printf("%f\n", d); // Use the value so it isn't optimized out of existence.
}
Both GCC and Clang perform the loading of *f
and conversion to double precision as two separate instructions even though the cvtss2sd
instruction supports memory as the source argument. According to Agner Fog, cvtss2sd r, m
executes as fast as movss r, m
on most architectures, and avoids needing to execute cvtss2sd r, r
afterwords. Nonetheless, Clang generates the following code for main()
:
main PROC
push rbp ;
mov rbp, rsp ;
call _Z8GetFloatv ;
movss xmm0, dword ptr [rax] ;
cvtss2sd xmm0, xmm0 ;
mov edi, offset ?_001 ;
mov al, 1 ;
call printf ;
xor eax, eax ;
pop rbp ;
ret ;
main ENDP
GCC generates similarly inefficient code. Why don't either of these compilers simply generate something like cvtss2sd xmm0, dword ptr [rax]
?
EDIT: Great answer, Stephen Canon! I took the assembly language output of Clang for my real use case, pasted it into a source file as inline ASM, benchmarked it, then made the changes discussed here and benchmarked it again. I couldn't believe that cvtss2sd [memory]
is actually slower.