producer/comsumer model and concurrent kernels

Question

I'm writing a cuda program that can be interpreted as producer/consumer model.

There are two kernels, one produces a data on the device memory,
and the other kernel the produced data.

The number of comsuming threads are set two a multiple of 32 which is the warp size.
and each warp waits utill 32 data have been produced.

I've got some problem here.
If the consumer kernel is loaded later than the producer,
the program doesn't halt.
The program runs indefinately sometimes even though consumer is loaded first.

What I'm asking is that is there a nice implementation model of producer/consumer in CUDA?
Can anybody give me a direction or reference?

here is the skeleton of my code.

**kernel1**:

while LOOP_COUNT
    compute something
    if SOME CONDITION
        atomically increment PRODUCE_COUNT          
        write data into DATA            
atomically increment PRODUCER_DONE

**kernel2**:
while FOREVER
    CURRENT=0
    if FINISHED CONDITION
        return
    if PRODUCER_DONE==TOTAL_PRODUCER && CONSUME_COUNT==PRODUCE_COUNT
        return
    if (MY_WARP+1)*32+(CONSUME_WARPS*32*CURRENT)-1 < PRODUCE_COUNT
        process the data
        if SOME CONDITION
            set FINISHED CONDITION true
        increment CURRENT
    else if PRODUCUER_DONE==TOTAL_PRODUCER
        if currnet*32*CONSUME_WARPS+THREAD_INDEX < PRODUCE_COUNT
            process the data
            if SOME CONDITION
                set FINISHED CONDITION true
            increment CURRENT

CygnusX1 · Answer 1 · 2011-03-04T07:47:44.553

Since you did not provide an actual code, it is hard to check where is the bug. Usually the sceleton is correct, but the problem lies in details.

One of possible issues that I can think of:

By default, in CUDA there is no guarantee that global memory writes by one kernel will be visible by another kernel, with an exception of atomic operations. It can happen then that your first kernel increments PRODUCER_DONE, but there is still no data in DATA.

Fortunately, you are given the intristic function __threadfence() which halts the execution of the current thread, until the data is visible. You should put it before atomically incrementing PRODUCER_DONE. Check out chapter B.5 in the CUDA Programming Guide.

Another issue that may or may not appear:

From the point of view of kernel2, the compiler may deduct that PRODUCE_COUNT, once read, it never changes. The compiler may optimise the code so that, once loaded in register it reuses its value, instead of querying the global memory every time. Solution? Use volatile, or read the value using another atomic operation.

(Edit) Third issue:

I forgot about one more problem. On pre-Fermi cards (GeForce before 400-series) you can run only a single kernel at a time. So, if you schedule the producer to run after the consumer, the system will wait for consumer-kernel to end before producer-kernel starts its execution. If you want both to run at the same time, put both into a single kernel and have an if-branch based on some block index.

producer/comsumer model and concurrent kernels

1 Answers1