How can I accurately measure and compare OpenCL speed for simple for loop function?

Question

I have recently implemented (Tested) OpenCL using a Struct to carry and update a C++ class object using a simple function written to the kernel and found to my dismay that the same function when processed without the kernel using a simple for loop was in fact faster.

Here is the kernel function :

 __kernel void function_x_y_(__global myclass_* input,long n)
{

int gid = get_global_id(0);
if(gid<n)
input[gid].valuez = input[gid].valuey * input[gid].valuex * 8736;

}

Here is the for loop :

for(int i=0;i<100;i++){
thisclass[i].function_x_y();
}

and the class function :

void function_x_y(){

valuez = valuex * valuey;

}

I ran a clock on both process :

cout<<"Run function in serial\n";
startTime = clock();
for(int i=0;i<100;i++){
thisclass[i].function_x_y();
}
endTime = clock();
cout << "It took (serial) " << (endTime -startTime) / (CLOCKS_PER_SEC / 1000000) << " ms. " << endl;


cout<<"Run function in parallel using struct to write to object\n";
init_ocl();
startTime = clock();
load_kernel_from_struct("function_x_y_",p_struct,100);      //Loads function and variables into opencl

endTime = clock();
cout << "It took (parallel) " << (endTime -startTime) / (CLOCKS_PER_SEC / 1000000 ) << " ms. " << endl;

With the output:

Run function in serial
It took (serial) 5 ms. 
Run function in parallel using struct to write to object
It took (parallel) 159010 ms.

I am using the cl-helper.c by Andreas Kloecker

I dont understand this it should be faster. Any help or advice is welcome.

Is there a more accurate speed test? Could this be due to the fact that it takes time to initialise assign memory and transfer the data to the kernel?

There must be a way to ensure that this works faster could it be that I must transfer and initialise everything before running the function?

Thanks, Hbyte.

Possible duplicate of [Measuring execution time of OpenCL kernels](http://stackoverflow.com/questions/23550912/measuring-execution-time-of-opencl-kernels) — Arash, Mar 09 '17 at 05:16
It depends on the number of Iterations for 5000000 iterations: It took (serial) 7133676 ms. Run function in parallel using struct to write to object Kernel Function:functions_.cl :function_x_y_ It took (parallel) 4753831 ms. I am using the function taken from [link](http://stackoverflow.com/questions/42360042/my-opencl-test-does-not-run-much-faster-than-cpu?rq=1)here. — hbyte, Mar 09 '17 at 12:56
İts Like shoveling just a grain of salt and comparing against Tweezers performance. Best small workload Latency is cpu's — huseyin tugrul buyukisik, Mar 15 '17 at 12:11

score 1 · Answer 1 · answered Mar 09 '17 at 17:52

The fact that your original test is using only 100 elements to test with ought to be a pretty major clue as to what's happening, not least of which because of how much the timings changed when you bumped the number of iterations up to 5 million.

C++ compilers are really good at optimizing loops. Especially loops with very few iterations (on the order of 10-10'000). It may be folding some of your logic into fewer instructions, speeding things up tremendously.
There is unavoidable overhead in OpenCL, caused by
- The online compilation of the kernel
- The need to transfer data to/from to GPU-accessible memory,
- The cost of synchronizing the asynchronous Host←→Device architecture
Since Compute devices behave by exploiting hundreds, sometimes even thousands of cores in the compute device in question, a loop over only 100 elements will perfectly saturate (one core of) a typical CPU, but will often only saturate a fraction of a GPU's cores.

One thing I would suggest, incidentally, is to perform your test by only measuring the submission and retrieval of the work data to the GPU, and not the time spent compiling the kernel, since this will more accurately model the comparison between the host code (which has been compiled beforehand, obviously) and the device code.

And, of course, if you plan to take full advantage of GPGPU devices, you need to make sure the workload is actually large enough to benefit from the parallelism, even in spite of the setup overhead.

How can I accurately measure and compare OpenCL speed for simple for loop function?

1 Answers1