Monte Carlo sweep in Cuda

Question

I have a Monte Carlo step in Cuda that I need a help with. I already wrote the serial code, and it works as expected. Let's say I have a 256 particles, which are stored in

vector< vector<double> > *r;

Each i in r has (x,y) component both of which are double. Here, r is the position of a particle.

Now, in CUDA, I'm supposed to assign this vector in Host, and send it to Device. Once in device, these particles need to interact with each other. Each thread is supposed to run a Monte Carlo Sweep. How do I allocate memories, reference/dereference pointers using cudaMalloc, which functions to make global/shared,...---I just can't wrap my head around it.

Here's what my memory allocation looks at the moment::

cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));    
CUDAErrorCheck();
kernel <<<blocks, threads>>> (&r, randomnums);
cudaDeviceSynchronize();
CUDAErrorCheck();
cudaMemcpy(r, blocks*threads*sizeof(double), cudaMemcpyDeviceToHost);

The above code is at potato level. I guess, I'm not sure what to do---even conceptually. My main problem is on allocating memories, and passing information to and from device & host. The vector r needs to be allocated, copied from host to device, do something with it in device, and copy it back to host. Any help/"pointers" will be much appreciated.

It is very difficult to tell you what to do when you are not telling us what you aim to do. As stated, I do not think your post can be answerable. Make an effort to translate your sequential code to CUDA and post questions on your trials. — Vitality, Apr 12 '14 at 22:51
My main problem is on allocating memories, and passing information to and from device & host. The vector r needs to be allocated, copied from host to device, do something with it in device, and copy it back to host. I know it is just few lines, but I failed dozens of times trying to do it and I'm just lost. — jimmu, Apr 12 '14 at 23:03

score 2 · Accepted Answer · answered Apr 13 '14 at 01:15

Your "potato level" code demonstrates a general lack of understanding of CUDA, including but not limited to the management of the r data. I would suggest that you increase your knowledge of CUDA by taking advantage of some of the educational resources available, and then develop an understanding of at least one basic CUDA code, such as the vector add sample. You will then be much better able to frame questions and understand the responses you receive. An example:

This would almost never make sense:

    cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));    
    CUDAErrorCheck();
    kernel <<<blocks, threads>>> (&r, randomnums);

You either don't know a very basic concept that data must be transferred to the device (via cudaMemcpy) before it can be used by a GPU kernel, or you can't be bothered to write "potato level" code that makes any sense at all - which would suggest to me a lack of effort in writing a sensible question. Also, regardless of what r is, passing &r to a cuda kernel would never make sense, I don't think.

Regarding your question about how to move r back and forth:

The first step in solving your problem will be to recast the r position data as something that is easily usable by a GPU kernel. In general, vector is not that useful for ordinary CUDA code and vector< vector< > > even less so. And if you have pointers floating about (*r) even less so. Therefore, flatten (copy) your position data into one or two dynamically allocated 1-D arrays of double:
```
#define N 1000 
...
vector< vector<double> > r(N);
...
double *pos_x_h, *pos_y_h, *pos_x_d, *pos_y_d;
pos_x_h=(double *)malloc(N*sizeof(double));
pos_y_h=(double *)malloc(N*sizeof(double));
for (int i = 0; i<N; i++){
  vector<double> temp = r[i];
  pos_x_h[i] = temp[0];
  pos_y_h[i] = temp[1];}
```

Now you can allocate space for the data on the device and copy the data to the device:

cudaMalloc(&pos_x_d, N*sizeof(double));
cudaMalloc(&pos_y_d, N*sizeof(double));
cudaMemcpy(pos_x_d, pos_x_h, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(pos_y_d, pos_y_h, N*sizeof(double), cudaMemcpyHostToDevice);

Now you can properly pass the position data to your kernel:
```
kernel<<<blocks, threads>>>(pos_x_d, pos_y_d, ...);
```

Copying the data back after the kernel will be approximately the reverse of the above steps. This will get you started:

cudaMemcpy(pos_x_h, pos_x_d, N*sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(pos_y_h, pos_y_d, N*sizeof(double), cudaMemcpyDeviceToHost);

There are many ways to skin the cat, of course, the above is just an example. However the above data organization will be well suited to a kernel/thread strategy that assigns one thread to process one (x,y) position pair.

Thanks for the detailed answer. As you mentioned, I'm a n00b in CUDA. It looks like I'm going to have to flatten my array and go from there. The above code to flatten array is giving me an error. It is probably because the position vector is of type vector< vector >. — jimmu, Apr 13 '14 at 02:18
I built a short test case to make sure that I had unpacked `vector >` correctly, before I posted my answer. It is [here](http://pastebin.com/8mYgUHmR), I didn't see any issue with it. — Robert Crovella, Apr 13 '14 at 02:23
I can't seem to find the source of the error. If you could check [here](http://pastebin.com/M2Sugxst) and see what's wrong, that'd be great. — jimmu, Apr 13 '14 at 03:26
This line of code is not correct: `65. pos_x_h[i] = temp[i][0];` Take another look at what I posted. — Robert Crovella, Apr 13 '14 at 05:41

Monte Carlo sweep in Cuda

1 Answers1