0

I'm working on an audio library in which the sine of a number needs to be calculated within a very tight loop. Various levels of inaccuracy in the results might be tolerable for the user depending on their goals and environment, so I'm providing the ability to pick between a few sine approximations with differing accuracy and speed characteristics. One of these shows as ~31% faster than glibc's sin() when running under callgrind, but ~2% slower when running outside of it if the library is compiled with -O3 and ~25% slower if compiled with -Ofast. Should I trust callgrind or the "native" results, in terms of designing the library's interface?

My gut instinct is to distrust callgrind and go with the wall-clock results, because that's what really matters in the end anyway. However, I'm worried that what I'm seeing is caused by something particular about my processor (i7-7700k), compiler (gcc 10.2.0) or other aspects of my environment (Arch Linux, kernel v5.9.13) that might not carry over for other users. Is there any chance that callgrind is showing me something "generally true", even if it's not quite true for me specifically?

The relative performance differences of the in-library sine implementations stay the same in and outside of callgrind; only the apparent performance of glibc's sin() differs. These patterns hold with variable amounts of work and across repeated runs. Interestingly, with -O1 the relative performance differences are comparable inside and outside of callgrind, but not with -O0, -O2, -O3, or -Ofast.

The input to glibc's sin() is in many ways a good case for it: it's a double that is always <= 2π, and is never subnormal, NaN, or infinite. This makes me wonder if the glibc sin() might be calling my CPU's fsin instruction some of the time, as Intel's documentation says it's reasonably accurate for arguments < ~3π/4 (see Intel 64 and IA-32 Architectures Developer's Manual: Vol. 1, pg. 8-22). If that is the case, it seems possible that the behavior of the Valgrind VM would have significantly different performance characteristics for that instruction, since in theory less attention might be paid to it during development than more frequently-used instructions. However, I've read the C source for the current Linux x86-64 implementation of sin() in glibc and I don't remember anything like that, nor do I see it in the callgrind disassembly (it seems to be doing its work "manually" using general-purpose AVX instructions). I've heard that glibc used to use fsin years ago, but my understanding is that they stopped because of its accuracy issues.

The only place I've found discussion of anything along the lines of what I'm seeing is an old thread on the GCC mailing list, but although it was interesting to look over I didn't notice anything there that clarified this (and I'd be wary about taking information from 2012 at face value anyway).

JohnH
  • 2,713
  • 12
  • 21
Zoë Sparks
  • 260
  • 3
  • 9
  • How do you profile your code? How did you calculated those `~31%` `~2%` numbers? Why not try other methods of profiling? What does it mean to "be outside of callgrind"? – KamilCuk Dec 15 '20 at 15:44
  • Both with GNU `time` and more precisely with C++'s `std::chrono::steady_clock` to time the relevant functions. They give comparable results. – Zoë Sparks Dec 15 '20 at 15:47
  • Not an answer since I have never used `callgrind`. The documentation *suggests* that it is an instrumenting, call-graph generating, high-overhead profiler. That means a high risk of distorting any wall clock timing of the application under test. Since end-to-end wall clock timing *at application level* is presumably all users care about, I would focus on that (use a high resolution system timer like Linux's `gettimeofday()`). `sin()` implementations typically have a "fast path" for arguments small in magnitude, which may explain your observations. – njuffa Dec 15 '20 at 22:24
  • @ZoëSparks My take is that instrumenting profilers tend to burden different parts of the code with *different amounts* of overhead, depending on how the code is structured into called subroutines. To confirm or refute this hypothesis as the explanation for the observations, detailed analysis of the generated code and the instrumentation points added by the profiler would appear to be necessary. I would recommend use of a relatively lightweight sampling profiler if more detailed performance data than simple app-level timing is needed. – njuffa Dec 16 '20 at 03:22
  • @njuffa That does seem logical, although in practice this is the first time I've seen something like this in callgrind. Generally the relative timings of different functions correlate strongly with other profiling methods even if the absolute timings differ. In this case callgrind represents the in-library function of interest as disproportionately fast, in addition to representing the glibc `sin()`-calling function as disproportionately slow. If you're right that that's just a quirk of the profiler and wouldn't show up on other hardware, I can relax. I'll look at what it's doing more deeply. – Zoë Sparks Dec 16 '20 at 06:09
  • 1
    @ZoëSparks Personally, I am a big fan of vigorously investigating any observation that does not fit one's mental model (i.e. "that doesn't make sense") until there is enough information to formulate a plausible explanation that jibes with the data/observations. My industry experience (I retired in 2014) indicates that not everybody agrees with that approach. – njuffa Dec 16 '20 at 06:22
  • @njuffa Yes, I very much agree. – Zoë Sparks Dec 16 '20 at 07:17

1 Answers1

1

When you run a program under Callgrind or any other tool of the Valgrind family, it is disassembled on the fly. The intermediate representation is then instrumented, and translated back to the native instruction set.

The profiling figures that Callgrind and Cachegrind give you are figures for the simplified processors they are modeling. As they don't have a detailed model of a modern CPU's pipeline, their results will not accurately reflect differences of actual performance (they can capture effects on the order of "this function executes 3x more instructions than the other function", but not "this instruction sequence can be executed with higher instruction-level parallelism").

One of most important things of computing sin-like functions in a loop is allowing computations to be vectorized: on x86, SSE2 offers 2x vectorization factor for double, 4x for float. The compiler can achieve that more easily if you have inlinable branchless approximate functions, although a possibility exists with new enough Glibc and GCC too (but you need to pass a large subset of -ffast-math flags to GCC to achieve it).

If you haven't seen it already: Arm's optimized-routines repository has a number of modern vectorizable implementations of several functions, including sin/cos in both single and double precision.

P.S. sin should never returns a zero result for a tiny but non-zero argument. When x is close to zero, sin(x) and x differ by less than x*x*x, so as you approach zero, x becomes the closest representable number to sin x.

amonakov
  • 2,324
  • 11
  • 23
  • Yes, this matches what I found when I looked more closely. The multiplications in the glibc `sin()` are vectorized efficiently, which makes its more expensive branches still not so costly on my CPU, but also its near-zero path is very cheap. It turned out that near-zero input was actually happening a lot more often than I realized; once I built in a near-zero path, my algorithms got around twice as fast for typical input sets, handily beating glibc `sin()` (though less accurate). Also, yeah, I was wrong, it does return `x` as opposed to zero—I deleted my comment so as not to mislead. – Zoë Sparks Dec 18 '20 at 15:14
  • Oh by the way, thanks for that link to that Arm repo! It's a great reference. – Zoë Sparks Jan 04 '21 at 04:03