You can use atomicCAS()
for this. It does an atomic Compare-And-Swap operation.
This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.
The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.
Here's a worked example:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Responding to a question in the comments, we could extend this to a char
sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd()
function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned
) and we will convert the needed operation to effectively support a char
size. This is done by converting the char
address to a suitably-aligned unsigned
address, and then doing shifting of the char
quantity to line up in the appropriate byte position in the unsigned
value.
Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
The above presents a generalized atomicCAS
for char
size. This would allow you to swap any char
value for any other char
value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr
as already mentioned in the comments. The use of the atomicOr
would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
These char
atomic methods assume that you have allocated a char
array whose size is a multiple of 4. It would not be valid to do this with a char
array of size 3 (and only 3 threads), for example.