Questions tagged [gpu-atomics]

Modern GPUs support atomic operations in different memory spaces. These are different in implementation and in consequences on execution flow than atomic operations on CPUs.

On modern GPUs, atomic operations in global device memory may require synchronization among thousands of logical threads (or hundreds of warps/wavefronts). A GPU may also support atomic operations on an individual processing core's memory (shared memory in CUDA parlance, local memory in OpenCL parlance) - which behave differently (performance-wise and execution-flow-wise) than global memory atomics.

Reading on GPU atomics:

The OpenCL reference guide's section on atomic operations.
Intel's guide to using atomic operations with OpelCL.
The section on atomic operations in nVIDIA's CUDA Programming guide.

34 questions

votes

1 answer

cuda atomic add visibility

In CUDA, is the result of atomic operation immediately visible to the threads of other warps in the same block as the one performing the atomic operation? In case of non-atomic operation, I know that the result may not be visible until…

cuda synchronization gpu-atomics

asked Jul 26 '11 at 21:32

small_potato

3,127
5
39
45

votes

1 answer

Why does the OpenCL atomic_add implementation for float produce a non-deterministic outcome?

I need to add a float to the same global memory address from within multiple threads in OpenCL. For any two simulation runs, the outcome is never identical and the calls to the atomic_add_f function are the source of this error. I'm using a Nvidia…

opencl non-deterministic gpu-atomics

asked Aug 30 '19 at 15:06

ProjectPhysX

4,535
2
14
34

votes

1 answer

Are atomic operations in CUDA guaranteed to be scheduled per warp?

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array. Now to write the actual values…

cuda gpu-atomics

asked Apr 12 '17 at 20:05

TheDutchDevil

vote

1 answer

Vulkan subgroupBarrier does not synchronize invokations

I have a somewhat complex procedure that contains nested loop and a subgroupBarrier. In a simplified form it looks like while(true){ while(some_condition){ if(end_condition){ atomicAdd(some_variable,1); …

gpgpu vulkan barrier gpu-atomics

asked Sep 11 '21 at 15:57

alagris

1,838
16
31

vote

1 answer

Do 64bit atomic operations work in openCL on AMD cards?

The implementation of emulated atomics in openCL following the STREAM blog works nicely for atomic add in 32bit, on CPU as well as NVIDIA and AMD GPUs. The 64bit equivalent based on the cl_khr_int64_base_atomics extension seems to run properly on…

opencl atomic amd-gpu gpu-atomics

asked Apr 22 '21 at 10:35

AdrianO

vote

1 answer

Cuda atomic lock: threads in sequence

I have a code of which a section needs to be executed critically. I am using a lock for that piece of code so that each thread of the kernel (set up with one thread per block) executes that piece of code atomically. The order of the threads is…

cuda gpu-atomics

asked Sep 11 '14 at 15:12

Stephen

vote

1 answer

atomicInc() is not working

I have tried below program using atomicInc(). __global__ void ker(int *count) { int n=1; int x = atomicInc ((unsigned int *)&count[0],n); CUPRINTF("In kernel count is %d\n",count[0]); } int main() { int hitCount[1]; int…

cuda gpu-atomics

asked Aug 02 '13 at 04:48

Alvin

vote

2 answers

How are global atomic operations implemented in Kepler? I got less performance using gmem rather than using atomics

I would like to know the implementation of global atomics in Kepler. see this piece of code: 1. if (threadIdx.x < workers) { 2. temp = atomicAdd(dst, temp + rangeOffset); 3. if (isLastPartialCalc(temp)) { 4. …

cuda gpu-atomics

asked Jul 01 '13 at 12:40

Dredok

votes

1 answer

CUDA: atomic operation on shared memory

My cuda kernel generates something that is fed to host in the end of block execution. The skeleton is as follows. host_data where data is written to is allocated as host mapped memory. host_data_count is also mapped memory which indicates the…

cuda gpu-atomics gpu-shared-memory

asked Feb 03 '12 at 10:45

superscalar

votes

1 answer

atomic operations in CUDA

The following program used the implementation of atomic locks from 'Cuda By Example', but running the program makes my machine frozen. Can someone tell me what's wrong with my program? Thanks a lot Yifei #include __global__ void…

cuda gpu-atomics

asked Jan 18 '12 at 20:59

user11869

1,083
2
14
29

votes

2 answers

CUDA atomicAdd_block is undefined

According to CUDA Programming Guide, "Atomic functions are only atomic with respect to other operations performed by threads of a particular set ... Block-wide atomics: atomic for all CUDA threads in the current program executing in the same thread…

cuda gpu-atomics compute-capability

asked Nov 02 '21 at 22:25

user2348209

votes

1 answer

In CUDA programming, is atomic function faster than reducing after calculating the intermediate results?

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory…

optimization parallel-processing cuda gpu-atomics

asked Apr 02 '21 at 15:24

zbh2047

votes

1 answer

CUDA global atomic operations across concurrent kernel executions

My CUDA application performs an associative reduction over a volume. Essentially each thread computes values which are atomically added to overlapping locations of the same output buffer in global memory. Is it possible to concurrently launch this…

cuda atomic cuda-streams gpu-atomics

asked Aug 10 '19 at 02:35

AnimatedRNG

1,859
3
26
39

votes

1 answer

CUDA critical sections, thread/warp execution model and NVCC compiler decisions

Recently I posted this question, about a critical section. Here is a similar question. In those questions the given answer says, that is up to the compiler if the code "works" or not, because the order of the various paths of execution is up to the…

compilation cuda compiler-optimization critical-section gpu-atomics

asked Jan 05 '19 at 21:06

fabian_mc

votes

2 answers

more than one variable to synchronize in CUDA

My program have lots of 4-byte-string, like "aaaa" "bbbb" "cccc"... I need to collect particular strings that passes a crc checking. Because there's very little chance that a string can pass the crc checking, so I don't want to use a very BIG…

cuda thread-safety gpu-atomics

asked Nov 10 '15 at 09:57

aj3423

2,003
3
32
70

Prev 1

3 Next