OpenCL bayer to rgb slower than CPU version

Question

I am trying to use OpenCL to improve bayer to rgb conversion on a i.mx6 platform. I am using this the OpenCL Shader for conversion. I am expecting the OpenCL version will be faster time than OpenCV's cvtColor bayer to rgb conversion (which uses a for loop). But it turns out that the OpenCL version is much slower than OpenCV's CPU based conversion.

OpenCV: 28.3 fps for 1280 * 960 image OpenCL: 7.15 fps for 1280 * 960 image

size_t global[] = {1280, 960};
Mat bayer = Mat(960, 1280, CV_8UC1);
Mat rgb_image = Mat(960, 1280, CV_8UC3);
cl_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,  width * height, bayer.data , &ret);
if(ret != CL_SUCCESS){
    cout << "OpenCL Buffer Allocation Error\n" ;
    exit(0);
}

cl_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, width * height * 3, rgb_image.data, &ret);
if(ret != CL_SUCCESS){
   cout << "OpenCL OP Buffer Error\n" ;
   exit(0);
  }

while(true){
   capture_image(bayer);
   ret = clEnqueueNDRangeKernel(cq, kernel,  2, NULL, global, NULL, 0, NULL, NULL);
   if(ret != CL_SUCCESS){
    cout << ret << " OpenCL kernel exec Error\n" ;
    exit(0);
   }
   ret = clEnqueueReadBuffer(cq, cl_output, CL_TRUE, 0, width * height * 3,  rgb_image.data, 0, NULL, NULL);
   if(ret != CL_SUCCESS){
    cout << "OpenCL OP Buffer read Error\n" ;
    exit(0);
   }
}

Is there anything I am missing in the configuration wise? I am new to OpenCL and I am not sure how to call openCL kernel inside the loop.The platform information is as follows:

Platform Name: Vivante OpenCL Platform
Platform Profile: EMBEDDED_PROFILE
Platform Version: OpenCL 1.1
Platform Vendor: Vivante Corporation
Device Name: Vivante OpenCL Device
Device Profile: EMBEDDED_PROFILE
Device Version: OpenCL 1.1
Device Vendor: Vivante Corporation
Device Max Work Item Dimensions: 3-D
Device Max Work Group Size: 1024

PS: I am not able to use OpenCL support in OpenCV as the i.mx6 doesn't OpenCL full profile which is needed for OpenCL support in OpenCV

score 0 · Answer 1 · answered May 15 '17 at 12:14

The only thing that seems a little bit odd is using a plain buffer rather than an image buffer however I don't expect this will make a massive difference to execution speed https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateImage.html

Khouri Giordano · Answer 2 · 2017-05-16T17:14:54.283

Process 4 pixels (the 2x2 bayer pattern) per thread. This avoids the fmod and ternary conditionals (c?t:f). You could try vectorizing this by 4 to produce 8x2 pixels per thread. So 4 red pixels in red.xyzw, etc. and using .even and .odd to do the deinterleave when reading and interleave when writing.

I also see that there is only a single thread per work group. Since you are targeting a single device type, try setting the workgroup size to match the max. That is, 32x32 or 16x64 if you are vectorizing.

score 0 · Answer 3 · answered May 22 '17 at 03:01

I don't know about your particular device, but if this were on a discrete GPU there are two optimizations that would make a big difference:

Don't re-read data from global memory (in the code you linked to, each source pixel is read many times). Instead, use shared local memory to share values read from global memory between work items in a work group. If you're not up for doing this, use an image instead of global memory. This will leverage the texture cache which will get you much of the same benefit.
Make sure you're using coalesced reads and writes. In general, this means making sure adjacent work items are reading adjacent memory locations.

OpenCL bayer to rgb slower than CPU version

3 Answers3