I'm working on an audio library in which the sine of a number needs to be calculated within a very tight loop. Various levels of inaccuracy in the results might be tolerable for the user depending on their goals and environment, so I'm providing the ability to pick between a few sine approximations with differing accuracy and speed characteristics. One of these shows as ~31% faster than glibc's sin()
when running under callgrind, but ~2% slower when running outside of it if the library is compiled with -O3
and ~25% slower if compiled with -Ofast
. Should I trust callgrind or the "native" results, in terms of designing the library's interface?
My gut instinct is to distrust callgrind and go with the wall-clock results, because that's what really matters in the end anyway. However, I'm worried that what I'm seeing is caused by something particular about my processor (i7-7700k), compiler (gcc 10.2.0) or other aspects of my environment (Arch Linux, kernel v5.9.13) that might not carry over for other users. Is there any chance that callgrind is showing me something "generally true", even if it's not quite true for me specifically?
The relative performance differences of the in-library sine implementations stay the same in and outside of callgrind; only the apparent performance of glibc's sin()
differs. These patterns hold with variable amounts of work and across repeated runs. Interestingly, with -O1
the relative performance differences are comparable inside and outside of callgrind, but not with -O0
, -O2
, -O3
, or -Ofast
.
The input to glibc's sin()
is in many ways a good case for it: it's a double
that is always <= 2π, and is never subnormal, NaN, or infinite. This makes me wonder if the glibc sin()
might be calling my CPU's fsin
instruction some of the time, as Intel's documentation says it's reasonably accurate for arguments < ~3π/4 (see Intel 64 and IA-32 Architectures Developer's Manual: Vol. 1, pg. 8-22). If that is the case, it seems possible that the behavior of the Valgrind VM would have significantly different performance characteristics for that instruction, since in theory less attention might be paid to it during development than more frequently-used instructions. However, I've read the C source for the current Linux x86-64 implementation of sin()
in glibc and I don't remember anything like that, nor do I see it in the callgrind disassembly (it seems to be doing its work "manually" using general-purpose AVX instructions). I've heard that glibc used to use fsin
years ago, but my understanding is that they stopped because of its accuracy issues.
The only place I've found discussion of anything along the lines of what I'm seeing is an old thread on the GCC mailing list, but although it was interesting to look over I didn't notice anything there that clarified this (and I'd be wary about taking information from 2012 at face value anyway).