I am trying to optimize a sin/cos approximation function. At its core there is a simple Horner scheme consisting of a bunch of multiplies and adds. Compiler is MSVC from VS2017, processor is Intel Xeon E5-1650, hyperthreading is on (but observations are basically identical if it is off).
Using Intel's VTune Amplifier 2019, I have obtained profiling results (release build, of course) for > 1 min of running the function on random doubles (between -2 pi and 2 pi), with ~40% of clockticks spent in the section shown below (the rest is range reduction + test harness). However, I cannot make sense of the microarchitectural metrics that VTune presents me:
(MSVC's source line attribution after inlining is awful.)
Here is the corresponding C++ code that got inlined:
void stableSinCosApproximation(double x, double* sinApprox, double* cosApprox)
{
double x2 = x * x;
*sinApprox = x * (sinCoeff[7] + x2 * (sinCoeff[6] + x2 * (sinCoeff[5] + x2 * (sinCoeff[4] + x2 * (sinCoeff[3] + x2 * (sinCoeff[2] + x2 * (sinCoeff[1] + x2 * sinCoeff[0])))))));
*cosApprox = (cosCoeff[7] + x2 * (cosCoeff[6] + x2 * (cosCoeff[5] + x2 * (cosCoeff[4] + x2 * (cosCoeff[3] + x2 * (cosCoeff[2] + x2 * (cosCoeff[1] + x2 * cosCoeff[0])))))));
}
Clearly, the assembly listing has only one continuous block of instructions. No jumps (and no jump targets), no branching or conditional execution at all. Yet, there are multiple metrics here whose values I cannot make sense of with the information provided by VTune's inbuilt or online help.
Concrete question:
The second half of the code has almost no attribution, clockticks and all. Why?
The first half has ever-increasing CPI rate. Ok, maybe this and the previous point are due to something about the attribution going wrong, but I don't get it.
The metrics say that there is bad speculation. But upon expanding that column, it shows neither branch mispredicts nor machine clears:
What is this supposed to tell me? In what capacity does the CPU speculate here?
I also allegedly lose a good chunk of uops to being front-end bound. Is the correlation to the bad speculation column only coincidence? What should I do with this information?
Preemptive notes:
The point of reimplementing this is guaranteed consistency across multiple platforms (from the same binary). The inbuilt sin/cos functions can vary by a few ULP across machines, which can kill reproducability of results.
Yes, I know about FMAs, but not every platforms that this (single) binary has to run on provides them. I'm not going for run-time dispatches at the moment.