For measuring an OpenCL kernel execution time we either uses a:
1- CPU Timers .. but we need to consider that the OCL functions are non-blocking hence we need to use the clFinish()
routine for achieving full throughput.
2- GPU Timers .. that is using clGetEventProfilingInfo()
routine along with setting the CL_QUEUE_PROFILING_ENABLE
flag in properties argument of either clCreateCommandQueue()
or clSetCommandQueueProperty()
routines.
How can the Operating System and the Driver version effect the accuracy of the timers used to measure the kernel execution time ?
All that I know is that we need to warm-up the device with at least one kernel call to absorb the latency of the OpenCL resource allocation at the very beginning.