1

Will this lead to inconsistencies in shared memory?

My kernel code looks like this (pseudocode):

__shared__ uint histogram[32][64];

uint threadLane = threadIdx.x % 32;

for (data){
     histogram[threadLane][data]++;
}

Will this lead to collisions, given that, in a block with 64 threads, threads with id x and (x + 32) will very often write into the same position in the matrix?

This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.

paleonix
  • 2,293
  • 1
  • 13
  • 29
ismarlowe
  • 119
  • 2
  • 13
  • Can you please provide some more details, especially about what `data` is in relation to threadIdx and about the launch configuration? Something that compiles would be better. – Davide Spataro Jul 12 '17 at 17:21

1 Answers1

2

It is dangerous. It leads to race conditions.

If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.

Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf

Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust for instance .

Davide Spataro
  • 7,319
  • 1
  • 24
  • 36
  • Thank you immensly! This was starting to drive me crazy. I will have to redo the whole thing in a different way. – ismarlowe Jul 12 '17 at 17:21
  • Take a look at `cuda-memcheck --tool racecheck`: http://docs.nvidia.com/cuda/cuda-memcheck/index.html#using-racecheck and https://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize – Davide Spataro Jul 12 '17 at 17:24
  • Is there a way to force one warp to finish a certain task before another takes in? Or, alternatively, is it possible to instruct the machine to access shared memory atomically? – ismarlowe Jul 12 '17 at 17:51
  • You should try to come up with a solution that does not require a particular scheduling of the warps. That is not what GPUs are good for. Read the doc from the NVIDIA guy I've linked you in my previous comment ;) – Davide Spataro Jul 12 '17 at 18:01