g++ 6.3, Kahan summation on avx intrinsics get serialized with volatile keyword

Question

Using avx intrinsics and Kahan summation algorithm, I've tried this(just a part of "adder"):

void add(const __m256 valuesToAdd)
{
    volatile __m256 y = _mm256_sub_ps(valuesToAdd, accumulatedError);
    volatile __m256 t = _mm256_add_ps(accumulator,y);
    accumulatedError = _mm256_sub_ps( _mm256_sub_ps(t,accumulator),y);
    accumulator = t;
}

there is no error but when I check the disassembly (perf record,report in ubuntu), it shows all elements of accumulator, y and accumulatedError variables are computed one by one, in scalar way.

Q: How can one define an intrinsic variable that can keep its "order of operations" and still be used(as vectorized) in an intrinsic instruction without being optimized-away?

To make sure it really is scalar, I removed volatile it became faster.

Is there a way to tell gcc that I need a variable/code vectorized but nothing else to be touched?

What are you trying to accomplish with `volatile`? To keep a local variable from being optimized out? Why would you need that? `volatile` is of no benefit for a local, stack based variable. With that keyword there, the compiler is apparently treating the access to each float as a separate distinct access under the assumption that they could change between accesses when done linearly (non-serialized), so the AVX intrinsic is coded to emulate that behavior. — 1201ProgramAlarm, Aug 06 '17 at 01:16
Then my only option is to use inline-asm to do non-changing ordered operations? — huseyin tugrul buyukisik, Aug 06 '17 at 01:20
I don't know. I don't understand what it is you're trying to accomplish. — 1201ProgramAlarm, Aug 06 '17 at 01:28
I'm using Kahan summation and its order of subtractions additions must not be altered. Gcc can't alter asm block but thats harder to code than intrinsics. — huseyin tugrul buyukisik, Aug 06 '17 at 01:33
Either you can do 8 subtractions at the same time (in which case the intrinsics work as well as the inline assembly), or the results change when you do the 8 subtractions at once because of a dependency between those 8 values (in which case the assembly code won't be able to change that). — 1201ProgramAlarm, Aug 06 '17 at 01:36
I've meant addition then subtraction then other subtraction can be completely optimized out instead of just reordered. Not that horizontal 8 elements. — huseyin tugrul buyukisik, Aug 06 '17 at 01:38
Ah, I see now that I've looked up the algorithm. You may be abke to disable optimizations using a `#pragma`, or put this function in a separate source module that you compile without optimizations. Using the AVX instructions here will run 8 parallel Kahan summations. Since this will accumulate the error differently than using one summation for all the values, you may get a slightly different answer when you combine them. — 1201ProgramAlarm, Aug 06 '17 at 01:53
Actually, gpgpu will have similar kahan summations (but much wider) so will not be a problem, shouldnt be at least. — huseyin tugrul buyukisik, Aug 06 '17 at 02:18
I can't reproduce your claim that g++6.3 un-vectorizes your intrinsics with `vaddss` instructions insteads of `vaddps ymm`. Instead, `volatile` has the expected effect of forcing actual loads/stores to memory for every read/write of the variable. https://godbolt.org/g/BDr2tn. Clang folds the reloads into memory operands for `vaddps` / `vsubps`, but gcc uses separate `vmovaps` instructions. Of course it became faster when you removed `volatile`, but not because of scalar! — Peter Cordes, Aug 07 '17 at 05:19

score 4 · Accepted Answer · answered Aug 06 '17 at 11:00

If you only want to explicitly prevent associative math optimization, don't use volatile but disable them using a function attribute:

__attribute__ ((optimize("no-fast-math"))) 
inline void add(const __m256 &valuesToAdd) 
{
  __m256 y = _mm256_sub_ps(valuesToAdd, accumulatedError);
  __m256 t = _mm256_add_ps(accumulator, y);
  accumulatedError = _mm256_sub_ps(_mm256_sub_ps(t, accumulator), y);
  accumulator = t;
}

Live Demo. Play around with compile flags and attributes. This attribute seems not to work with clang (I guess there is something equivalent, but your question was g++ specific).

Ill compare against inline assembly solution . If comparable performance then it worths Boeing gcc only. thank you — huseyin tugrul buyukisik, Aug 06 '17 at 11:02

g++ 6.3, Kahan summation on avx intrinsics get serialized with volatile keyword

1 Answers1