cudaDeviceSynchronize() waits to finish only in current CUDA context or in all contexts?

Question

I use CUDA 6.5 and 4 x GPUs Kepler.

I use multithreading, CUDA runtime API and access to the CUDA contexts from different CPU threads (by using OpenMP - but it does not really matter).

When I call cudaDeviceSynchronize(); will it wait for kernel(s) to finish only in current CUDA context which selected by the latest call cudaSetDevice(), or in all CUDA contexts?
If it will wait for kernel(s) to finish in all CUDA contexts, then it will wait in all CUDA contexts which used in current CPU thread (in example CPU thread_0 will wait GPUs: 0 and 1) or generally all CUDA contexts (CPU thread_0 will wait GPUs: 0, 1, 2 and 3)?

Following code:

// For using OpenMP requires to set:
// MSVS option: -Xcompiler "/openmp"
// GCC option: –Xcompiler –fopenmp
#include <omp.h>

int main() {

    // execute two threads with different: omp_get_thread_num() = 0 and 1
    #pragma omp parallel num_threads(2)
    {
        int omp_threadId = omp_get_thread_num();

        // CPU thread 0
        if(omp_threadId == 0) {

            cudaSetDevice(0);
            kernel_0<<<...>>>(...);
            cudaSetDevice(1);
            kernel_1<<<...>>>(...);

            cudaDeviceSynchronize(); // what kernel<>() will wait?

        // CPU thread 1
        } else if(omp_threadId == 1) {

            cudaSetDevice(2);
            kernel_2<<<...>>>(...);
            cudaSetDevice(3);
            kernel_3<<<...>>>(...);

            cudaDeviceSynchronize(); // what kernel<>() will wait?

        }
    }

    return 0;
}

score 13 · Accepted Answer · edited Apr 30 '20 at 11:19

When I call cudaDeviceSynchronize(); will it wait for kernel(s) to finish only in current CUDA context which selected by the latest call cudaSetDevice(), or in all CUDA contexts?

cudaDeviceSynchronize() syncs all streams in the current CUDA context only.

Note: cudaDeviceSynchronize() will only synchronize host with the currently set GPU, if multiple GPUs are in use and all need to be synchronized, cudaDeviceSynchronize() has to be called separately for each one.

Here is a minimal example:

cudaSetDevice(0); cudaDeviceSynchronize();
cudaSetDevice(1); cudaDeviceSynchronize();
...

Source: Pawel Pomorski, slides of "CUDA on multiple GPUs". Linked here.

In the case of 1 GPU and multiple CPU threads, if I call `cudaDeviceSynchronize()` from one CPU thread, will it wait for other CPU threads to finish their work on the device? — Serge Rogatch, Sep 06 '16 at 17:17

cudaDeviceSynchronize() waits to finish only in current CUDA context or in all contexts?

1 Answers1