CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour

Question

I wrote a code which uses many host (OpenMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:

#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    while (hasJob()) {

        //... code to prepare job - dData, hData, dataSize etc

        cudaError_t streamStatus = cudaStreamQuery(stream);
        if (streamStatus == cudaSuccess) {
             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
        else {
             CUDA_CHECK(streamStatus);
        }
        cudaStreamSynchronize(stream);
    }
    cudaStreamDestroy(stream);
}

And everything were good till I got many small jobs. In that case, from time to time, cudaStreamQuery returns cudaErrorNotReady, which is for me unexpected because I use cudaStreamSynchronize. Till now I were thinking that cudaStreamQuery will always return cudaSuccess if it is called after cudaStreamSynchronize. Unfortunately it appeared that cudaStreamSynchronize may finish even when cudaStreamQuery still returns cudaErrorNotReady.

I changed the code into the following and everything works correctly.

#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    while (hasJob()) {

        //... code to prepare job - dData, hData, dataSize etc

        cudaError_t streamStatus;
        while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {
             cudaStreamSynchronize();
        }
        if (streamStatus == cudaSuccess) {
             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
        else {
             CUDA_CHECK(streamStatus);
        }
        cudaStreamSynchronize(stream);
    }
    cudaStreamDestroy(stream);
}

So my question is.... is it a bug or a feature?

EDIT: it is similar to JAVA

synchronize {
    while(waitCondition) {
         wait();
    }
}

Given that this is a bug vs. feature question you might be better off asking on the NVIDIA forums where someone from their product team can clarify it. — Ade Miller, Mar 08 '11 at 14:58
@Ade: I did that - http://forums.nvidia.com/index.php?showtopic=194982 — kokosing, Mar 08 '11 at 15:01
Are you sure you have initialized cuda correctly in each thread? Do you pass the correct stream to `cudaStreamSynchronize` (no parameter in your code)? On the other hand the stream does not have to be "ready" to schedule more memory transfers and kernel executions to it. That's why it's called a stream... — Jonas Bötel, Mar 08 '11 at 17:33
You were right, I added `stream` as a parameter in `cudaStreamSynchronize`. As I wrote before I am using 4.0 RC, and as far as I know there is no special care needed to initialize each host thread. I know that I can schedule jobs to stream as many as I want even if stream is already being executed, but I want to know when particular stream finished its jobs. — kokosing, Mar 09 '11 at 09:51

score 2 · Accepted Answer · answered Mar 09 '11 at 08:53

What is under

//... code to prepare job - dData, hData, dataSize etc

Do you have any functions of kind cudaMemcpyAsync there, or the only memory transfer is in the code you have shown? Those are asynchronous functions may exit early, even when the code is not at the destination yet. When that happens cudaStreamQuery will return cudaSuccess only when memory transfers succeed.

Also, does hasJob() uses any of the host-CUDA functions?

If I am not mistaken, in a single stream, it is not possible to execute both kernel and memory transfers. Therefore, calling cudaStreamQuery is necessary only when a kernel depends on the data transferred by a different stream.

As you said those are asynchronous functions, so I wonder weather mentioned behavior is a normal situation, which may happen from time aka. feature, and they may exit earlier than expected. All cuda function invocations were listed, hasJob() function or code to prepare job is only in regular c++. Here all streams are independent, I just want to which of them finished its work and I want to schedule it a new job - dynamically and efficient. — kokosing, Mar 09 '11 at 10:01

score 1 · Answer 2 · answered Mar 09 '11 at 10:25

1

Didn't notice it earlier: cudaStreamSynchronize() should take a parameter (stream). I am not sure which stream you are synchronising when parameter is ommited, could be that it defaults to stream 0.

answered Mar 09 '11 at 10:25

CygnusX1

20,968
5
65
109

You are right, I omitted it by accident. I fixed the question. – kokosing Mar 09 '11 at 10:26
Did it solve your problem, or was that just an error in the question code and not in your real source code? The reason I am asking is that I believe you should not experience what you describe and I am thinking about possible errors in the code before blaming NVIDIA (although they do have a lot of bugs). – CygnusX1 Mar 09 '11 at 10:30
Nope. It was only here in question. In my code was proper `cudaStreamSynchronize` with `stream` in it. – kokosing Mar 09 '11 at 13:11

CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour

2 Answers2