Overlap compute and transfer on Windows

Question

i'm facing some issues when trying to overlap computation and transferts on Windows (using VS2015 and CUDA 10.1). The code doesn't overlap at all. But the exact same code on Linux as the expected behaviour.

Here is the views from NVVP :

Windows 10 NVVP screenshot :

Linux NVVP screenshot :

Please note the following points :

my host memory is PageLocked
i'm using two different streams
i'm using cudaMemcpyAsync method to transfert between host and device
if i run my code on Linux, everything is fine
i don't see anything in the documentation describing a different behaviour between there two systems.

So the question is the following :

Am i missing something ? Does it exists a way to achieve overlapping on this configuration (Windows 10 + 1080Ti)?

you can find some code here to reproduce the issue :

#include "cuda_runtime.h"

constexpr int NB_ELEMS = 64*1024*1024;
constexpr int BUF_SIZE = NB_ELEMS * sizeof(float);

constexpr int BLK_SIZE=1024;

using namespace std;

__global__
void dummy_operation(float* ptr1, float* ptr2)
{
    const int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if(idx<NB_ELEMS)
    {
        float value = ptr1[idx];
        for(int i=0; i<100; ++i)
        {
            value += 1.0f;
        }

        ptr2[idx] = value;
    }
}


int main()
{
    float *h_data1 = nullptr, *h_data2 = nullptr,
        *h_data3 = nullptr, *h_data4 = nullptr;
    cudaMallocHost(&h_data1, BUF_SIZE);
    cudaMallocHost(&h_data2, BUF_SIZE);
    cudaMallocHost(&h_data3, BUF_SIZE);
    cudaMallocHost(&h_data4, BUF_SIZE);

    float *d_data1 = nullptr, *d_data2 = nullptr,
        *d_data3 = nullptr, *d_data4 = nullptr;

    cudaMalloc(&d_data1, BUF_SIZE);
    cudaMalloc(&d_data2, BUF_SIZE);
    cudaMalloc(&d_data3, BUF_SIZE);
    cudaMalloc(&d_data4, BUF_SIZE);

    cudaStream_t st1, st2;
    cudaStreamCreate(&st1);
    cudaStreamCreate(&st2);

    const dim3 threads(BLK_SIZE);
    const dim3 blocks(NB_ELEMS / BLK_SIZE + 1);

    for(int i=0; i<10; ++i)
    {
        float* tmp_dev_ptr = (i%2)==0? d_data1 : d_data3;
        float* tmp_host_ptr = (i%2)==0? h_data1 : h_data3;
        cudaStream_t tmp_st = (i%2)==0? st1 : st2;
        cudaMemcpyAsync(tmp_dev_ptr, tmp_host_ptr, BUF_SIZE, cudaMemcpyDeviceToHost, tmp_st);
        dummy_operation<<<blocks, threads, 0, tmp_st>>>(tmp_dev_ptr, d_data2);
        //cudaMempcyAsync(d_data2, h_data2);
    }

    cudaStreamSynchronize(st1);
    cudaStreamSynchronize(st2);

    return 0;
}

This is probably related to the behaviour of the WDDM driver on Windows compared to linux. There might be no solution except for using a supported GPU in TCC mode — talonmies, Mar 09 '20 at 12:47
The code and the profiler output don't match. You are calling `cudaMemcpyDeviceToHost` but in the profiler output, it is H2D. Are you sure this code produces these profiler results? — heapoverflow, Mar 09 '20 at 12:54
@AkifÇördük you are right, since the memory is pageLocked it look like the device memory or the host memory are considered as device, i tried cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, and cudaMemcpyDeviceToDevice performs a copy from host to device (by respect to the given pointers). — X3liF, Mar 09 '20 at 13:15
@talomnies thank you for pointing me this, i'll try to find a quadro card to confirm. — X3liF, Mar 09 '20 at 13:15

X3liF · Answer 1 · 2020-04-02T16:18:09.663

1

As pointed by @talonmies, to overlap compute and transfers you need a graphic card in Tesla Compute Cluster mode.

I've checked this behaviour using an old Quadro P620.

[Edit] Overlapping between kernels and copy seems to be working since i applied the Windows 10 update 1909.

I'm not sure if the windows update has included an graphic driver update or not. But it's fine :)

edited Apr 02 '20 at 16:18

answered Mar 09 '20 at 15:06

X3liF

1,054
6
10

1

It isn't necessarily that your need a TCC card to get overlap, but the WDDM driver does a lot of command batching and stuff which might defeat the natural flow of streamed operations and cause overlap to be lost – talonmies Mar 09 '20 at 16:17
Does it exists a way to disable all this stuff using registry keys ? – X3liF Mar 09 '20 at 16:34
No idea, I am not a habitual WDDM user, but I very much doubt it – talonmies Mar 09 '20 at 16:47
You can experiment with inserting [`cudaStreamQuery(stream)`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g2021adeb17905c7ec2a3c1bf125c5435) after the kernel launch, to stop the driver batching up kernels and launch the current batch immediately. – tera Mar 10 '20 at 10:38
I vaguely remember this is not the recommended course of action anymore, but I can't recall what the suggested replacement is. If anyone remembers, please post a link here. – tera Mar 10 '20 at 10:39

Overlap compute and transfer on Windows

1 Answers1

Linked