Questions tagged [gpu-atomics]

Modern GPUs support atomic operations in different memory spaces. These are different in implementation and in consequences on execution flow than atomic operations on CPUs.

On modern GPUs, atomic operations in global device memory may require synchronization among thousands of logical threads (or hundreds of warps/wavefronts). A GPU may also support atomic operations on an individual processing core's memory (shared memory in CUDA parlance, local memory in OpenCL parlance) - which behave differently (performance-wise and execution-flow-wise) than global memory atomics.

Reading on GPU atomics:

34 questions
2
votes
1 answer

cuda atomic add visibility

In CUDA, is the result of atomic operation immediately visible to the threads of other warps in the same block as the one performing the atomic operation? In case of non-atomic operation, I know that the result may not be visible until…
small_potato
  • 3,127
  • 5
  • 39
  • 45
2
votes
1 answer

Why does the OpenCL atomic_add implementation for float produce a non-deterministic outcome?

I need to add a float to the same global memory address from within multiple threads in OpenCL. For any two simulation runs, the outcome is never identical and the calls to the atomic_add_f function are the source of this error. I'm using a Nvidia…
ProjectPhysX
  • 4,535
  • 2
  • 14
  • 34
2
votes
1 answer

Are atomic operations in CUDA guaranteed to be scheduled per warp?

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array. Now to write the actual values…
TheDutchDevil
  • 826
  • 11
  • 24
1
vote
1 answer

Vulkan subgroupBarrier does not synchronize invokations

I have a somewhat complex procedure that contains nested loop and a subgroupBarrier. In a simplified form it looks like while(true){ while(some_condition){ if(end_condition){ atomicAdd(some_variable,1); …
alagris
  • 1,838
  • 16
  • 31
1
vote
1 answer

Do 64bit atomic operations work in openCL on AMD cards?

The implementation of emulated atomics in openCL following the STREAM blog works nicely for atomic add in 32bit, on CPU as well as NVIDIA and AMD GPUs. The 64bit equivalent based on the cl_khr_int64_base_atomics extension seems to run properly on…
AdrianO
  • 175
  • 1
  • 11
1
vote
1 answer

Cuda atomic lock: threads in sequence

I have a code of which a section needs to be executed critically. I am using a lock for that piece of code so that each thread of the kernel (set up with one thread per block) executes that piece of code atomically. The order of the threads is…
Stephen
  • 13
  • 1
  • 4
1
vote
1 answer

atomicInc() is not working

I have tried below program using atomicInc(). __global__ void ker(int *count) { int n=1; int x = atomicInc ((unsigned int *)&count[0],n); CUPRINTF("In kernel count is %d\n",count[0]); } int main() { int hitCount[1]; int…
Alvin
  • 940
  • 2
  • 13
  • 27
1
vote
2 answers

How are global atomic operations implemented in Kepler? I got less performance using gmem rather than using atomics

I would like to know the implementation of global atomics in Kepler. see this piece of code: 1. if (threadIdx.x < workers) { 2. temp = atomicAdd(dst, temp + rangeOffset); 3. if (isLastPartialCalc(temp)) { 4. …
Dredok
  • 807
  • 1
  • 9
  • 30
0
votes
1 answer

CUDA: atomic operation on shared memory

My cuda kernel generates something that is fed to host in the end of block execution. The skeleton is as follows. host_data where data is written to is allocated as host mapped memory. host_data_count is also mapped memory which indicates the…
superscalar
  • 23
  • 1
  • 6
0
votes
1 answer

atomic operations in CUDA

The following program used the implementation of atomic locks from 'Cuda By Example', but running the program makes my machine frozen. Can someone tell me what's wrong with my program? Thanks a lot Yifei #include __global__ void…
user11869
  • 1,083
  • 2
  • 14
  • 29
0
votes
2 answers

CUDA atomicAdd_block is undefined

According to CUDA Programming Guide, "Atomic functions are only atomic with respect to other operations performed by threads of a particular set ... Block-wide atomics: atomic for all CUDA threads in the current program executing in the same thread…
user2348209
  • 136
  • 11
0
votes
1 answer

In CUDA programming, is atomic function faster than reducing after calculating the intermediate results?

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory…
zbh2047
  • 393
  • 1
  • 9
0
votes
1 answer

CUDA global atomic operations across concurrent kernel executions

My CUDA application performs an associative reduction over a volume. Essentially each thread computes values which are atomically added to overlapping locations of the same output buffer in global memory. Is it possible to concurrently launch this…
AnimatedRNG
  • 1,859
  • 3
  • 26
  • 39
0
votes
1 answer

CUDA critical sections, thread/warp execution model and NVCC compiler decisions

Recently I posted this question, about a critical section. Here is a similar question. In those questions the given answer says, that is up to the compiler if the code "works" or not, because the order of the various paths of execution is up to the…
0
votes
2 answers

more than one variable to synchronize in CUDA

My program have lots of 4-byte-string, like "aaaa" "bbbb" "cccc"... I need to collect particular strings that passes a crc checking. Because there's very little chance that a string can pass the crc checking, so I don't want to use a very BIG…
aj3423
  • 2,003
  • 3
  • 32
  • 70