I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing. OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:
pseudocode:
int t_id = get_global_id(0);
if(t_id < n * n)
{
m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}
As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this
- For 1 ... X
- Copy memory to OpenCL device
- Call kernel
- Copy memory back
I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.
- Copy memory to OpenCL device
- Call kernel X times OR call kernel one time and make it compute X cycles.
- Copy memory back
I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.
Or maybe there is another way to do this optimally?