CUDA Atomic operation on array in global memory

Question

I have a CUDA program whose kernel basically does the following.

I provide a list of n points in cartesian coordinates e.g. (x_i,y_i) in a plane of dimension dim_x * dim_y. I invoke the kernel accordingly.
For every point on this plane (x_p,y_p) I calculate by a formula the time it would take for each of those n points to reach there; given those n points are moving with a certain velocity.
I order those times in increasing order t_0,t_1,...t_n where the precision of t_i is set to 1. i.e. If t'_i=2.3453 then I would only use t_i=2.3.
Assuming the times are generated from a normal distribution I simulate the 3 quickest times to find the percentage of time those 3 points reached earliest. Hence suppose prob_0 = 0.76,prob_1=0.20 and prob_2=0.04 by a random experiment. Since t_0 reaches first most amongst the three, I also return the original index (before sorting of times) of the point. Say idx_0 = 5 (An integer).
Hence for every point on this plane I get a pair (prob,idx).

Suppose n/2 of those points are of one kind and the rest are of other. A sample image generated looks as follows.

Unoptimized

Especially when precision of the time was set to 1 I noticed that the number of unique 3 tuples of time (t_0,t_1,t_2) was just 2.5% of the total data points i.e. number of points on the plane. This meant that most of the times the kernel was uselessly simulating when it could just use the values from previous simulations. Hence I could use a dictionary having key as 3-tuple of times and value as index and prob. Since as far as I know and tested, STL can't be accessed inside a kernel, I constructed an array of floats of size 201000000. This choice was by experimentation since none of the top 3 times exceeded 20 seconds. Hence t_0 could take any value from {0.0,0.1,0.2,...,20.0} thus having 201 choices. I could construct a key for such a dictionary like the following

Key = t_o * 10^6 + t_1 * 10^3 + t_2

As far as the value is concerned I could make it as (prob+idx). Since idx is an integer and 0.0<=prob<=1.0, I could retrieve both of those values later by

prob=dict[key]-floor(dict[key])
idx = floor(dict[key])

So now my kernel looks like the following

__global__ my_kernel(float* points,float* dict,float *p,float *i,size_t w,...){
  unsigned int col = blockIdx.y*blockDim.y + threadIdx.y;
  unsigned int row = blockIdx.x*blockDim.x + threadIdx.x;
  //Calculate time taken for each of the points to reach a particular point on the plane
  //Order the times in increasing order t_0,t_1,...,t_n
  //Calculate Key = t_o * 10^6 + t_1 * 10^3 + t_2
  if(dict[key]>0.0){
    prob=dict[key]-floor(dict[key])
    idx = floor(dict[key])
  }
  else{
    //Simulate and find prob and idx

    dict[key]=(prob+idx)
  }
  p[row*width+col]=prob;
  i[row*width+col]=idx;
}

The result is quite similar to the original program for most points but for some it is wrong.

Optimized but not correct

I am quite sure that this is due to race condition. Notice that dict was initialized with all zeroes. The basic idea would be to make the data structure "read many write once" in a particular location of the dict.

I am aware that there might be much more optimized ways of solving this problem rather than allocating so much memory. Please let me know in that case. But I would really like to understand why this particular solution is failing. In particular I would like to know how to use atomicAdd in this setting. I have failed to use it.

Since `dict` is in global memory, I don't think your `syncthreads` is doing anything useful. It is only a block-level synchronisation. It is also risky as there is the possibility that some members of a block will call it and others won't. — Tom Fenech, Nov 27 '13 at 10:46
I have tried it with the syncthreads removed. The output is the same. — Sandipan Bhattacharyya, Nov 27 '13 at 10:49
From what I understand from a perhaps superficial reading of your post, you do not necessarily have a one-to-one mapping between `key` and the 2D thread index. If this is true, I agree that you are most probably having a race condition since all the threads will read `dict[key]`, but some of them (presumably those for which `dict[key]==0` since I do not expect that `dict` will have negative values) are changing them. — Vitality, Nov 27 '13 at 11:00
@SandipanBhattacharyya In that case, all the threads in the block are taking the same branch. If some take the `if` and others the `else`, you will have a deadlock. In any case, there is no advantage to some threads running the simulation and some threads reading `dict`, as all the threads in the block have to wait for each other anyway. — Tom Fenech, Nov 27 '13 at 11:05
Could you please provide the smallest version of your code reproducing your problem? I'm afraid that the question, as it is, will keep unanswerable. The problem behind your second wrong image might also be in the unshown part. — Vitality, Nov 27 '13 at 14:01
I have downvoted this question. The post is unclear and the poster is adding his own reasoning as comments hiding some relevant information. I'm even not sure the answer below really answers the question. — Vitality, Nov 27 '13 at 16:12

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

Unless your simulation in the else branch is very long (~100s of floating-point operations), a lookup table in global memory is likely to be slower than running the computation. Global memory access is very expensive!

In any case, there is no way to save time by "skipping work" using conditional branching. The Single Instruction, Multiple Thread architecture of a GPU means that the instructions for both sides of the branch will be executed serially, unless all of the threads in a block follow the same branch.

edit:

The fact that you are seeing a performance increase as a result of introducing the conditional branch and you didn't have any problems with deadlock suggests that all the threads in each block are always taking the same branch. I suspect that once dict starts getting populated, the performance increase will go away.

Perhaps I have misunderstood something, but if you want to calculate the probability of an event x, assuming a normal distribution and given the mean mu and standard deviation sigma, there is no need to generate a load of random numbers and approximate a Gaussian curve. You can directly calculate the probability:

p = exp(-((x - mu) * (x - mu) / (2.0f * sigma * sigma))) / 
    (sigma * sqrt(2.0f * M_PI));

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 27 '13 at 11:20

Tom Fenech

72,334
12
107
141

Thanks for the reply! Though I have to add a few points here. The _else_ branch is actually very long. I generate 500 random number numbers from the normal distribution having mean = t_i and standard deviation = function(t_i) for i in {0,1,2} and do calculations thereof to calculate the probabilities. Also, the time taken for a 10560 x 6800 points plane for my problem in the first case was 2.48 seconds. When I just allocate so much global memory it jumps time taken to generate this single image jumps to 3.19 seconds. – Sandipan Bhattacharyya Nov 27 '13 at 11:58
But as soon as I introduce the if-else branch the time taken reduces to 2.39 seconds. The idea from here is to generate not one frame of output but give as input points across multiple frames and get back as output probabilities across multiple frames while saving on the time taken for simulation. This I hope since the number of unique 3 tuples being quite less compared to the total number of points in a single frame would be more so when taking 2-3 frames together. – Sandipan Bhattacharyya Nov 27 '13 at 12:00
I have removed syncthreads from the question hence! Thanks for the explanation. – Sandipan Bhattacharyya Nov 27 '13 at 12:14
Thanks!Now I know it's a non-CUDA problem. Also I might not have been able to question properly. The idea is that we need to find the prob(t_i < min{t_j} | forall j != i). Hence suppose in simulation # 1 I draw three random numbers drawn from N(t_0,f(t_0)),N(t_1,f(t_1)) and N(t_2,f(t_2)) respectively. I repeat this for K simulations and find the index of the point who reaches first and the probability of it reaching before all others on average. There are formulae for such from order statistics for IID random variables but this is a non IID case and I am not aware of any simple solution. – Sandipan Bhattacharyya Nov 27 '13 at 14:12
1

"unless it is known in advance which branch all the threads will take (branch predication), the instructions for both sides of the branch will be executed serially." Predication is not related to branch prediction. As far as I know, there is no branch prediction implemented on current NVIDIA GPUs. Predication is simply an efficient way to handle branches that only skip one or a few instructions. – Roger Dahl Nov 27 '13 at 15:46
@RogerDahl thanks for the correction, I have edited accordingly. – Tom Fenech Nov 27 '13 at 16:04

CUDA Atomic operation on array in global memory

1 Answers1

edit: