CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?

Question

Will this lead to inconsistencies in shared memory?

My kernel code looks like this (pseudocode):

__shared__ uint histogram[32][64];

uint threadLane = threadIdx.x % 32;

for (data){
     histogram[threadLane][data]++;
}

Will this lead to collisions, given that, in a block with 64 threads, threads with id x and (x + 32) will very often write into the same position in the matrix?

This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.

Can you please provide some more details, especially about what `data` is in relation to threadIdx and about the launch configuration? Something that compiles would be better. — Davide Spataro, Jul 12 '17 at 17:21

Davide Spataro · Accepted Answer · 2017-07-12T17:26:36.893

2

It is dangerous. It leads to race conditions.

If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.

Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf

Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust for instance .

edited Jul 12 '17 at 17:26

answered Jul 12 '17 at 17:19

Davide Spataro

7,319
1
24
36

Thank you immensly! This was starting to drive me crazy. I will have to redo the whole thing in a different way. – ismarlowe Jul 12 '17 at 17:21
Take a look at `cuda-memcheck --tool racecheck`: http://docs.nvidia.com/cuda/cuda-memcheck/index.html#using-racecheck and https://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize – Davide Spataro Jul 12 '17 at 17:24
Is there a way to force one warp to finish a certain task before another takes in? Or, alternatively, is it possible to instruct the machine to access shared memory atomically? – ismarlowe Jul 12 '17 at 17:51
You should try to come up with a solution that does not require a particular scheduling of the warps. That is not what GPUs are good for. Read the doc from the NVIDIA guy I've linked you in my previous comment ;) – Davide Spataro Jul 12 '17 at 18:01

CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?

1 Answers1