0

I'm using OpenCL 2.0 dynamic parallelism feature and have each workitem enqueue another kernel with single workitem. When work completion time of child kernel is high, parent kernel completes before children and memory consistency is not preserved and corrupt data (randomly updated data items) is returned.

Since clFinish() and clEnqueueMarkerWithWaitList() is for host-only queues, I can't use them for this default-on-device-out-of-order-queue.

How can I make child kernels finish before some synchronization point or at least before a buffer-read command so that memory consistency is achieved?

Here is the code:

__kernel void test( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)
{
    int threadId=get_global_id(0);
    float dx=xyz[threadId*3]-arguments[2];float dy=xyz[threadId*3+1]-arguments[3];float t=arguments[1];
    float ctr=arguments[0];float wave=0.02f*ctr*sin(40.0f*t+100.0f*sqrt(dx*dx+dy*dy));
    xyzo[threadId*3]=xyz[threadId*3]+xyzn[threadId*3]*wave; // wave equation for all surface vertices
    xyzo[threadId*3+1]=xyz[threadId*3+1]+xyzn[threadId*3+1]*wave; // wave equation for all surface vertices
    xyzo[threadId*3+2]=xyz[threadId*3+2]+xyzn[threadId*3+2]*wave; // wave equation for all surface vertices
}

__kernel void waveEquation( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)
{
    int threadId=get_global_id(0);
    if(threadId<arguments[4])
    {
            queue_t q = get_default_queue();
            ndrange_t ndrange = ndrange_1D(threadId,1,1);
            void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};
            enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);

    }

}

when parent kernel has only 1-2 workitems, it works fine but there are normally 256*224 workitems for parent kernel and child kernels cannot complete before data is accessed from host(after clFinish())

Here is construction of default queue(different than the queue for parent-kernel)

commandQueue = cl::CommandQueue(context, device,
   CL_QUEUE_ON_DEVICE|
   CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | 
   CL_QUEUE_ON_DEVICE_DEFAULT, &err);

edit: this way of creating the queue also does not make it synchronizable:

cl_uint qs=device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();
cl_queue_properties qprop[] = { CL_QUEUE_SIZE, qs, CL_QUEUE_PROPERTIES, 
     (cl_command_queue_properties)(CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE |
                                   CL_QUEUE_ON_DEVICE | 
                                   CL_QUEUE_ON_DEVICE_DEFAULT | 
                                   CL_QUEUE_PROFILING_ENABLE), 0 };
device_queue = clCreateCommandQueueWithProperties(context.get(),
                                   device.get(), qprop, &err);

device=RX550, driver=17.6.2, 64 bit build.


User Parallel Highway's solution also didn't work:

if(threadId<arguments[4])
{
        clk_event_t markerEvent;
        clk_event_t events[1];
        queue_t q = get_default_queue();
        ndrange_t ndrange = ndrange_1D(threadId,1,1);
        void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};
        enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,0,NULL,&events[0],my_block_A);
        enqueue_marker(q, 1, events, &markerEvent);

        release_event(events[0]);
        release_event(markerEvent);

}

This didn't work:

queue_t q = get_default_queue();
ndrange_t ndrange = ndrange_1D(threadId,1,1);
void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};
int ctr=0;
while((enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A)&
        (   CLK_DEVICE_QUEUE_FULL|
            CLK_EVENT_ALLOCATION_FAILURE|
            CLK_OUT_OF_RESOURCES |
            CLK_INVALID_NDRANGE |
            CLK_INVALID_QUEUE |
            CLK_INVALID_EVENT_WAIT_LIST |
            CLK_INVALID_ARG_SIZE
        ))>0 )
{
}

this doesn't work but completes so there is no infinite loop.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97

1 Answers1

1

You should consider using enqueue_marker:

https://www.khronos.org/registry/OpenCL/specs/opencl-2.0-openclc.pdf#page=172

There is also an example in the specification where multiple kernels are enqueued and with an enqueue_marker command you can wait for the child kernels to finish, then proceed with the parent kernel. The sample code is here:

https://www.khronos.org/registry/OpenCL/specs/opencl-2.0-openclc.pdf#page=175

Edit: After multiple experiments, the findings are as follows: As the number of child-kernels a parent kernel launches increases, the program fails. This is probably caused by the queue_size as huseyin tugrul buyukisik suggested. Although, the execution does not return an error code, the results are incorrect. There is no mention of this kind of issue in the OpenCL specification.

parallel highway
  • 354
  • 2
  • 12
  • These are kernel-side events and markers right? I'll try. – huseyin tugrul buyukisik Jul 02 '17 at 14:51
  • Look at my question's end, it didn't work. Also this wouldn't work with `CLK_ENQUEUE_FLAGS_WAIT_KERNEL` since this forces children kernels to wait for parent kernel to complete first and markers already waiting for children, making a dead lock. Not only didn't work but also stuck computing, I had to ctrl alt del. – huseyin tugrul buyukisik Jul 02 '17 at 15:18
  • What happens if number of all enqueue_kernel calls is greater than queue size? Don't they wait for the queue to get ready? Or it causes this? When I reduce parent kernel workitems to 224 and have each workitem produce a child kernel with 256 workitems instead of just 1, it works but I also had to add "-g -D CL_VERSION_2_0" to clBuildProgram() on top of "-cl-std=CL2.0". I don't why this worked and if queue size is a problem or not. – huseyin tugrul buyukisik Jul 02 '17 at 15:37
  • First, your kernels need to use CLK_ENQUEUE_FLAGS_NO_WAIT to ensure you will not have any deadlocks. Your initial code was also using that flag, so I thought you would use it. Nevertheless, I think OpenCL specification leaves handling these kinds of deadlocks to the programmer. – parallel highway Jul 02 '17 at 15:55
  • Thank you, what do you think of queue size? If I enqueue_kernel for 100k times and if queue size is 50k, does this create an invisible error such as re-using queue locations which are full of kernels waiting to be computed? Or, the device queue works like a pipe and blocks new enqueue commands until some kernels are computed? Host queues work like a pipe at least, they run async while host enqueues new commands and block if no space left. – huseyin tugrul buyukisik Jul 02 '17 at 15:59
  • I was looking for information regarding what would happen if the CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE limit is exceeded but I couldn't find any information. Which leads me to think depending on the driver implementation, newly created tasks may be dropped. In the meantime, you may try creating that many tasks and enqueuing from host side to check if that fails as well. If it also fails, I would think it is because of the queue size. – parallel highway Jul 02 '17 at 16:04
  • That explains the randomly updated elements in data array. So dynamic parallelism is not that much fine grained and expected to have groups (also it should be more efficient to have 256 workitems per workgroup but I was just trying to try a simple case) – huseyin tugrul buyukisik Jul 02 '17 at 16:07
  • clinfo says Max on device events---1024 Queue on device max size---8388608 Max on device queues---1 what does 8.4 MB mean? How many kernel enqueue commands fit there? Can the number of parameters set per kernel change it? What if events are bound to that command, does it increase it size too? – huseyin tugrul buyukisik Jul 02 '17 at 16:11
  • Another thing you can check is if enqueue_marker return CLK_DEVICE_QUEUE_FULL at some point [link](https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/enqueue_marker.html). The same flag is returned on enqueue_kernel function as well. – parallel highway Jul 02 '17 at 16:13
  • Where is the info about that `CLK_DEVICE_QUEUE_FULL`? But I will try. I didn't know that this exists. Maybe it works on enqueue_kernels too? Then it wouldn't need any markers. – huseyin tugrul buyukisik Jul 02 '17 at 16:14
  • According to this [link](https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/enqueue_kernel.html), the return value of enqueue_kernels can be CLK_DEVICE_QUEUE_FULL. You can check the return value to see if it is full. So, yes it would be better if you check the enqueue_kernel call's return value. – parallel highway Jul 02 '17 at 16:20
  • `while(enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A)==CLK_DEVICE_QUEUE_FULL )` didn't work but its error must be hidden already by something else like overflowing queue size. I mean, concurrently overflowing queue size, if such thing exists. – huseyin tugrul buyukisik Jul 02 '17 at 16:22
  • Do you mean, `!= CLK_DEVICE_QUEUE_FULL` ? Otherwise you will only enqueue once. – parallel highway Jul 02 '17 at 16:25
  • No, I meant, try until it enqueues, once per workitem, Parent kernel has 224*256 workitems. This is less than 64k. Maybe each enqueue_kernel implicitly runs an event? Max events is 1024 for rx550. – huseyin tugrul buyukisik Jul 02 '17 at 16:27
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/148174/discussion-between-huseyin-tugrul-buyukisik-and-parallel-highway). – huseyin tugrul buyukisik Jul 02 '17 at 16:32