I would like to know the implementation of global atomics in Kepler.
see this piece of code:
1. if (threadIdx.x < workers) {
2. temp = atomicAdd(dst, temp + rangeOffset);
3. if (isLastPartialCalc(temp)) {
4. atomicAdd(dst,-300000.0f);
5. }
6. }
if I change line 4 for this:
*dst -= 300000.0f;
The performance is lower! The change is safe, since no more threads will write on this value (the output are the same).
kernel using atomic: ~883us kernel using gmem directly: ~903us
I have run several times and I always get this ~20us penalty for the change
UPDATE It seems the store without using the atomic always produces a miss in L2 whilst the atomic version always produces a hit ... so I guess trying to write to some location which was flagged (or something) with "atomic" is not allowed in L2 and it makes another request to gmem