13

Hello: Does Global Work Size (Dimensions) Need to be Multiple of Work Group Size (Dimensions) in OpenCL?

If so, is there a standard way of handling matrices not a multiple of the work group dimensions? I can think of two possibilities:

Dynamically set the size of the work group dimensions to a factor of the global work dimensions. (this would incur the overhead of finding a factor and possibly set the work group to a non-optimal size.)

Increase the dimensions of the global work to be the nearest multiple of the work group dimensions, keeping all input and output buffers the same but checking bounds in the kernel to avoid segfaulting, i.e. do nothing on the work items out of bound of the desired output. (This seems like the better way.)

Would the second way work? Is there a better way? (Or is it not necessary because work group dimensions need not divide global work dimensions?)

Thanks!

Junier
  • 1,622
  • 1
  • 15
  • 21
  • Interesting that you are asking this question, as I was asking it myself. My first guess is that you keep a fixed work group size and handle out of bound by an if branch. – elmattic Jun 30 '10 at 10:05

3 Answers3

7

Thx for the link Chad. But actually, if you read on:

If local_work_size is specified, the values specified in global_work_size[0], … global_work_size[work_dim - 1] must be evenly divisible by the corresponding values specified in local_work_size[0], … local_work_size[work_dim – 1].

So YES, the local work size must be a multiple of the global work size.

I also think the assigning the global work size to the nearest multiple and being careful about bounds should work, I'll post a comment when I get around to trying it.

Junier
  • 1,622
  • 1
  • 15
  • 21
4

This seems to be an old post, but let me update this post with some new information. Hopefully, it could help someone else.

Does Global Work Size (Dimensions) Need to be Multiple of Work Group Size (Dimensions) in OpenCL?

Answer: True till OpenCL 2.0. Before CL2.0, your global work size must be a multiple of local work size, otherwise you will get an error message when you execute clEnqueueNDRangeKernel.

But from CL2.0, this is not required anymore. You can use whatever global work size which fits your application dimensions. However, please remember that the hardware implementation might still use the "old" way, which means padding the global work group size. Therefore, it makes the performance highly dependent on the hardware architecture. You may see quite different performance on different hardware/platforms. Plus, you want to make your application back compatible to support older platform which only supports CL up to version 1.2. So, I think this new feature added in CL2.0 is just for easy programming, to get better controllable performance and backward compatibility, I suggest you still use the following method mentioned by you:

Increase the dimensions of the global work to be the nearest multiple of the work group dimensions, keeping all input and output buffers the same but checking bounds in the kernel to avoid segfaulting, i.e. do nothing on the work items out of bound of the desired output. (This seems like the better way.)

Answer: you are absolutely right. This is the right way to handle such case. Carefully design the local work group size (considering factors such as register usage, cache hit/miss, memory access pattern and so on). And then pad your global work size to a multiple of local work size. Then, you are good to go.

Another thing to consider is that you can utilize the image object to store the data instead of buffer, if there are quite a lot of boundary checking work in your kernel. For image, the boundary check is automatically done by hardware, almost no overhead in most of the implementations. Therefore, padding your global work size, store your data in image object, then, you just need to write your code normally without worrying about the boundary checking.

Robert Wang
  • 1,161
  • 9
  • 15
1

According to the standard it doesn't have to be from what I saw. I think I would handle it with a branch, but I don't know exactly what kind of matrix operation you are doing.

http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf#page=131

global_work_size points to an array of work_dim unsigned values that describe the number of global work-items in work_dim dimensions that will execute the kernel function. The total number of global work-items is computed as global_work_size[0] * ... * global_work_size[work_dim – 1].

The values specified in global_work_size + corresponding values specified in global_work_offset cannot exceed the range given by the sizeof(size_t) for the device on which the kernel execution will be enqueued. The sizeof(size_t) for a device can be determined using CL_DEVICE_ADDRESS_BITS in table 4.3. If, for example, CL_DEVICE_ADDRESS_BITS = 32, i.e. the device uses a 32-bit address space, size_t is a 32-bit unsigned integer and global_work_size values must be in the range 1 .. 2^32 - 1. Values outside this range return a CL_OUT_OF_RESOURCES error.

elmattic
  • 12,046
  • 5
  • 43
  • 79
Chad Brewbaker
  • 2,523
  • 2
  • 19
  • 26