I need to add a float
to the same global memory address from within multiple threads in OpenCL. For any two simulation runs, the outcome is never identical and the calls to the atomic_add_f
function are the source of this error. I'm using a Nvidia Titan Xp GPU with driver 436.02.
Since OpenCL does not support atomic_add
with float
, there are ways around using atomic_cmpxchg
:
void atomic_add_f(volatile global float* addr, const float val) {
union {
uint u32;
float f32;
} next, expected, current;
current.f32 = *addr;
do {
next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
} while(current.u32!=expected.u32);
}
However, this code does produce a non-deterministic result. The results vary slightly in each run, similar to when a race condition would be present.
I also tried this version
void atomic_add_f(volatile global float* addr, const float val) {
private float old, sum;
do {
old = *addr;
sum = old+val;
} while(atomic_cmpxchg((volatile global int*)addr, as_int(old), as_int(sum))!=as_int(old));
}
which does not work properly either. The version presented here does not work either.
How can this be and how to solve it?