7

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?

The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.

In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.

What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?

int3h
  • 462
  • 4
  • 15

2 Answers2

2

60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!

However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:

A is 100 elements
B is 200 elements
C is 100 elements

big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]

Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:

A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]

I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.

On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.

I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.

Ryan Marcus
  • 966
  • 8
  • 21
  • The 60 arguments are because this code is being generated by a special code synthesizer for a research project I'm a part of. Unfortunately, I can't control that part. I did end up using the buffer packing methodology you outlined. Hopefully it's a better method than 60 arguments. Thanks for your help! – int3h Jul 10 '12 at 23:35
0

You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:

kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)

so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.

Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

James
  • 2,373
  • 18
  • 16
  • So there's no way around having 60 arguments for each kernel function? I realize I can keep the arguments around, but stil, that means each thread starts off using 240 bytes of memory just for pointers, which never change and are the same for all kernels. Plus, there's the potential performance overhead in passing the arguments -- these kernels will be called 60 times/second. – int3h Jun 17 '12 at 07:00