1

I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing. OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:

pseudocode:

int t_id = get_global_id(0);
if(t_id < n * n)
{
    m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}

As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this

  • For 1 ... X
    1. Copy memory to OpenCL device
    2. Call kernel
    3. Copy memory back

I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.

  1. Copy memory to OpenCL device
  2. Call kernel X times OR call kernel one time and make it compute X cycles.
  3. Copy memory back

I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.

Or maybe there is another way to do this optimally?

elklepo
  • 509
  • 4
  • 17

1 Answers1

1
Copy memory to OpenCL device
Call kernel X times
Copy memory back

this works. Make sure call kernel is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.

If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.

Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.

Then, for further performance, you can apply local-memory optimizations.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • Can You make one more thing clear for me? When memory is actually copied to OpenCL device? 1) when creating buffer: Buffer A(context,CL_MEM_COPY_HOST_PTR, size, mem); 2) when setting kernel arg: kernel->setArg(0, A); 3) when running kernel: enqueueNDRangeKernel(...); I've made some time measuring in my code and it is still not clear to me. – elklepo Nov 21 '16 at 12:20
  • none of them. it is copied after queue is flushed and after it started to be issued by gpu and just before enqueuewritebuffer command is completed. Assuming all commands are non-blocking type. If enqueuewritebuffer is set blocking, then buffer is completed copying before exiting that enqueuewritbuffer function. clFinish() creates a start signal for queue so it all begins. clFinish() is the simplest way to synchronize host and device to be sure if queue is completed. However, an implementation may let a command start immediately asychronously and you may need to put a user event handler at top – huseyin tugrul buyukisik Nov 21 '16 at 12:31