0

I have a pretty simple OpenCL kernel, basically doing nothing more than defining its input:

__kernel void test(__read_only image3d_t d_multitest){}

My host side code is basic pyopencl to transfer an image to my device and run the kernel:

import pyopencl as cl
import numpy as np

platform = cl.get_platforms()[0]
devs = platform.get_devices()
device1 = devs[1]
h_test = np.zeros((64,512,512)).astype(np.float32,order='F')
mf = cl.mem_flags
ctx = cl.Context([device1])
Queue1 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
Queue2 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
event_list=[]

fi = open('Minimal.cl', 'r')
fstr = "".join(fi.readlines())
prg = cl.Program(ctx, fstr).build()
knl = prg.test

d_test = cl.Image(ctx,mf.READ_ONLY, cl.ImageFormat(cl.channel_order.INTENSITY, cl.channel_type.FLOAT), h_test.shape)
e1 = cl.enqueue_copy(Queue1, d_test, h_test, is_blocking = False, origin = (0,0,0), region = h_test.shape)
knl.set_args(d_test)
cl.enqueue_nd_range_kernel(Queue2,knl,(512,512,64),None,wait_for=[e1,])

I am profiling this code on different devices and see, that the transfer time basically scales with the memory bandwith of the device, which is expected. On the other hand, my Kernel execution time varies wildly.

On Nvidia the Kernel execution duration is <<1ms. Profiling on Nvidia

however, the exact same Kernel takes >20ms on AMDs implementation. enter image description here

My question is, if this kind of overhead is tolerable, or if I am doing something basically wrong.

Hardware:

NVIDIA GeForce GTX TITAN X

AMD ATI Radeon 9 290X

Host: Ubuntu 16.04

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Dschoni
  • 3,714
  • 6
  • 45
  • 80

0 Answers0