I have a pretty simple OpenCL kernel, basically doing nothing more than defining its input:
__kernel void test(__read_only image3d_t d_multitest){}
My host side code is basic pyopencl to transfer an image to my device and run the kernel:
import pyopencl as cl
import numpy as np
platform = cl.get_platforms()[0]
devs = platform.get_devices()
device1 = devs[1]
h_test = np.zeros((64,512,512)).astype(np.float32,order='F')
mf = cl.mem_flags
ctx = cl.Context([device1])
Queue1 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
Queue2 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
event_list=[]
fi = open('Minimal.cl', 'r')
fstr = "".join(fi.readlines())
prg = cl.Program(ctx, fstr).build()
knl = prg.test
d_test = cl.Image(ctx,mf.READ_ONLY, cl.ImageFormat(cl.channel_order.INTENSITY, cl.channel_type.FLOAT), h_test.shape)
e1 = cl.enqueue_copy(Queue1, d_test, h_test, is_blocking = False, origin = (0,0,0), region = h_test.shape)
knl.set_args(d_test)
cl.enqueue_nd_range_kernel(Queue2,knl,(512,512,64),None,wait_for=[e1,])
I am profiling this code on different devices and see, that the transfer time basically scales with the memory bandwith of the device, which is expected. On the other hand, my Kernel execution time varies wildly.
On Nvidia the Kernel execution duration is <<1ms.
however, the exact same Kernel takes >20ms on AMDs implementation.
My question is, if this kind of overhead is tolerable, or if I am doing something basically wrong.
Hardware:
NVIDIA GeForce GTX TITAN X
AMD ATI Radeon 9 290X
Host: Ubuntu 16.04