0

I've set up a convolution kernel in OpenCL to convolve a 228x228x3 image with 11x11x3x96 weights to produce 55x55x96 filters.

My code without allotting localWorkSize works perfectly, but when I do allot it, I start getting errors

My questions are therefore,

1) How many threads are being launched when I set localWorkSize to NULL? I'm guessing it's implicit but is there any way to get those numbers?

2) How should I allot localWorkSize to avoid errors?


//When localWorkSize is NULL

size_t globalWorkSize[3] = {55,55,96}; 

//Passing NULL for localWorkSize argument 

errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, NULL,0, NULL,&event);

//WORKS PERFECTLY
// When I set localWorkSize

size_t globalWorkSize[3] = {55,55,96}; 
size_t localWorkSize[3] = {1,1,1};

errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, localWorkSize,0, NULL,&event);

//ERROR CONTEXT CODE 999

I'm just trying to understand how many threads are created when localWorkSize is Null and GlobalWorkSize is described

tera
  • 7,080
  • 1
  • 21
  • 32
  • localWorkSize always needs to divide globalWorkSize evenly (0 == globalWorkSize % localWorkSize). Always try to pass a localWorkSize that fits the number of processing elements in a compute unit. Also, make sure you always stay within the boundaries of what is returned by clGetDeviceInfo with `CL_DEVICE_MAX_WORK_ITEM_SIZES` – AlexG Apr 26 '19 at 18:52
  • Yes! This works, but somehow, the time taken is still about the same – Sushanto Praharaj May 01 '19 at 12:08
  • I have no idea what your kernel does, but it seems 55x55x96 is a relatively small problem to solve. If you have multiple read/writes, then using local memory might speed things up a little instead of always accessing global memory. – AlexG May 01 '19 at 17:22

0 Answers0