3 queues + 1 finish or device-side checkpoints for all queues

Question

Is there a special "wait for event" function that can wait for 3 queues at the same time at device side so it doesn't wait for all queues serially from host side?

Is there a checkpoint command to send into a command queue such that it must wait for other command queues to hit same(vertically) barrier/checkpoint to wait and continue from device side so no host-side round-trip is needed?

For now, I tried two different versions:

clWaitForEvents(3, evt_);

and

int evtStatus0 = 0;
clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
    sizeof(cl_int), &evtStatus0, NULL);

while (evtStatus0 > 0)
{

    clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
        sizeof(cl_int), &evtStatus0, NULL);
    Sleep(0);
}

int evtStatus1 = 0;
clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
    sizeof(cl_int), &evtStatus1, NULL);

while (evtStatus1 > 0)
{

    clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
        sizeof(cl_int), &evtStatus1, NULL);
    Sleep(0);
}


int evtStatus2 = 0;
clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
    sizeof(cl_int), &evtStatus2, NULL);

while (evtStatus2 > 0)
{

    clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
        sizeof(cl_int), &evtStatus2, NULL);

    Sleep(0);
}

second one is a bit faster(I saw it from someone else) and both are executed after 3 flush commands.

Looking at CodeXL profiler results, first one waits longer between finish points and some operations don't even seem to be overlapping. Second one shows 3 finish points are all within 3 milliseconds so it is faster and longer parts are overlapped(read+write+compute at the same time).

If there is a way to achieve this with only 1 wait command from host side, there must a "flush" version of it too but I couldn't find.

Is there any way to achieve below picture instead of adding flushes between each pipeline step?

queue1 write checkpoint write    checkpoint write
queue2  -               compute  checkpoint compute checkpoint compute
queue3  -                        checkpoint read    checkpoint read

all checkpoints have to be vertically synchronized and all these actions must not start until a signal is given. Such as:

queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);

queue1.flush() 
queue2.flush()
queue3.flush()

queue1.finish()
queue2.finish()
queue3.finish()

checkpoints are all handled in device side and only 3 finish commands are needed from host side(even better,only 1 finish for all queues?)

How I bind 3 queues to 3 events with "clWaitForEvents(3, evt_);" for now is:

hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[2]);

if this "enqueue barrier" can talk with other queues, how could I achieve that? Do I need to keep host-side events alive until all queues are finished or can I delete them or re-use them later? From the documentation, it seems like first barrier's event can be put to second queue and second one's barrier event can be put to third one along with first one's event so maybe it is like:

hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(evt_0, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(evt_0_and_1, &evt[2]);

in the end wait for only evt[2] maybe or using only 1 same event for all:

hCommandQueue->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[2]);

where to get sameEvt object?

anyone tried this? Should I start all queues with a barrier so they dont start until I raise some event from host side or lazy-executions of "enqueue" is %100 trustable to "not to start until I flush/finish" them? How do I raise an event from host to device(sameEvt doesn't have a "raise" function, is it clCreateUserEvent?)?

All 3 queues are in-order type and are in same context. Out-of-order type is not supported by all graphics cards. C++ bindings are being used.

Also there are enqueueWaitList(is this deprecated?) and clEnqueueMarker but I don't know how to use them and documentation doesn't have any example in Khronos' website.

Alex · Accepted Answer · 2016-09-08T13:13:24.427

You asked too many questions and expressed too many variants to provide you with the only solution, so I will try to answer in general that you can figure out the most suitable solution.

If the queues are bind to the same context (possibly to different devices within the same context) than it is possible to synchronize them through the events. I.e. you can obtain an event from a command submitted to one queue and use this event to synchronize a command submitted to another queue, e.g.

queue1.enqueue(comm1, /*dependency*/ NULL, /*result event*/ &e1);
queue2.enqueue(comm2, /*dependency*/ &e1, /*result event*/ NULL);

In this example, comm2 will wait for comm1 completion.

If you need to enqueue commands first but no to allow them to be executed you can create user event (clCreateUserEvent) and signal it manually (clSetUserEventStatus). The implementation is allowed to process command as soon as they enqueued (the driver is not required to wait for the flush).

The barrier seems overkill for your purpose because it waits for all commands previously submitted to the queue. You can really use clEnqueueMarker that can be used to wait for all events and provide one event to be used for other commands.

As far as I know you can retain the event at any moment if you do not need it more. The implementation should prolong the event life-time if it is required for internal purposes.

I do not know what is enqueueWaitList.

Off-topic: if you need non-trivial dependencies between calculations you may want to consider TBB flow graph and opencl_node. The opencl_node uses events for syncronization and avoids "host-device" synchronizations if possible. However, it can be tricky to use multiple queues for the same device.

As far as I know, Intel HD Graphics 530 supports out-of-order queues (at least host-side).

Thank you, I used marker as first enqueue then 1 second after flush+finish, I set it with clsetusereventstatus. It started all queues exactly 1 seconds. I will try same with the end point to get a single event to wait for(using marker but with device side event). — huseyin tugrul buyukisik, Sep 08 '16 at 13:57

score 1 · Answer 2 · answered Sep 08 '16 at 22:21

1

You are making it much harder than it needs to be. On the write queue take an event. Use that as a condition for the compute on the compute queue, and take another event. Use that as a condition on the read on the read queue. There is no reason to force any other synchronization. Note: My interpretation of the spec is that you must clFlush on a queue that you took an event from before using that event as a condition on another queue.

answered Sep 08 '16 at 22:21

Dithermaster

6,223
1
12
20

Tried your way, simply 1-way event chain on 3 queues, driver couldn't put them overlapping positions and resulting all-serial smaller kernels. It took 140 ms while 1-queue(ordered) took 127 ms. When I add same event to another queue(not needed on its own queue), 2-way event only enabled overlapped executions of read+write+compute resulting 112 ms. Either CodeXL's timeline graph is showing wrong and also my own C# stopwatch timer is wrong or driver is not capable. It fully completes writes before compute begins. Then compute ends, then read begins. But in all cases, events have 1-2 ms latency. – huseyin tugrul buyukisik Sep 10 '16 at 21:08
Maybe R7-240 is not capable of 3-queue concurrency in hardware(emulates software) since compute is non-existent, it should have halve timing (read overlaps write)? Maybe only compute + read or compute + write capable? – huseyin tugrul buyukisik Sep 10 '16 at 21:13
You were right, there wasn't a need to make it harder. I just used 4 queues, made the pipelining horizontal instead of vertical, and no events, it overlaps more efficiently showing no holes between commands(also +1 queue add some more efficiency too) – huseyin tugrul buyukisik Sep 14 '16 at 00:04

3 queues + 1 finish or device-side checkpoints for all queues

2 Answers2