Using avx intrinsics and Kahan summation algorithm, I've tried this(just a part of "adder"):
void add(const __m256 valuesToAdd)
{
volatile __m256 y = _mm256_sub_ps(valuesToAdd, accumulatedError);
volatile __m256 t = _mm256_add_ps(accumulator,y);
accumulatedError = _mm256_sub_ps( _mm256_sub_ps(t,accumulator),y);
accumulator = t;
}
there is no error but when I check the disassembly (perf record,report in ubuntu), it shows all elements of accumulator, y and accumulatedError variables are computed one by one, in scalar way.
Q: How can one define an intrinsic variable that can keep its "order of operations" and still be used(as vectorized) in an intrinsic instruction without being optimized-away?
To make sure it really is scalar, I removed volatile it became faster.
Is there a way to tell gcc that I need a variable/code vectorized but nothing else to be touched?