I have a Monte Carlo step in Cuda that I need a help with. I already wrote the serial code, and it works as expected. Let's say I have a 256 particles, which are stored in
vector< vector<double> > *r;
Each i in r has (x,y) component both of which are double. Here, r is the position of a particle.
Now, in CUDA, I'm supposed to assign this vector in Host, and send it to Device. Once in device, these particles need to interact with each other. Each thread is supposed to run a Monte Carlo Sweep. How do I allocate memories, reference/dereference pointers using cudaMalloc, which functions to make global/shared,...---I just can't wrap my head around it.
Here's what my memory allocation looks at the moment::
cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));
CUDAErrorCheck();
kernel <<<blocks, threads>>> (&r, randomnums);
cudaDeviceSynchronize();
CUDAErrorCheck();
cudaMemcpy(r, blocks*threads*sizeof(double), cudaMemcpyDeviceToHost);
The above code is at potato level. I guess, I'm not sure what to do---even conceptually. My main problem is on allocating memories, and passing information to and from device & host. The vector r needs to be allocated, copied from host to device, do something with it in device, and copy it back to host. Any help/"pointers" will be much appreciated.