CUDAfy CopyFromDevice several orders of magnitude slower than CopyToDevice

Question

I'm testing CUDAfy with a small gravity simulation and after running a profiler on the code I see that most of the time is spent on the CopyFromDevice method of the GPU. Here's the code:

    private void WithGPU(float dt)
    {
        this.myGpu.CopyToDevice(this.myBodies, this.myGpuBodies);
        this.myGpu.Launch(1024, 1, "MoveBodies", -1, dt, this.myGpuBodies);
        this.myGpu.CopyFromDevice(this.myGpuBodies, this.myBodies);
    }

Just to clarify, this.myBodies is an array with 10,000 structs like the following:

[Cudafy(eCudafyType.Struct)]
[StructLayout(LayoutKind.Sequential)]
internal struct Body
{
    public float Mass;

    public Vector Position;

    public Vector Speed;
}

And Vector is a struct with two floats X and Y.

According to my profiler the average timings for those three lines are 0.092, 0.192 and 222.873 ms. These timings where taken on a Windows 7 with a NVIDIA NVS 310.

Is there a way to improve the time of the CopyFromDevice() method?

Thank you

Perhaps it's taking 222.873 ms to actually perform the processing? CopyFromDevice would need to wait for the processing to complete before it can do the copying. — Darren Gourley, Nov 12 '15 at 12:26
Good question. Honestly I don't know. I spent a bit of time using CUDA for a problem I was having; I ended up scraping the concept as it was more hassle than what it was worth and actually took longer to produce a result for my particular problem. From what I can remember, you're defining the size of the "grid" as 1024, but the block is only set to 1. I think this essentially means you're only using one thread on the GPU. Don't quote me on that - I just don't have time to look up the documentation at the minute. They do have a useful CODAfy.net tutorial with code examples that might be helpful — Darren Gourley, Nov 12 '15 at 12:38
The first comment by Darren is likely the case. Try adding `this.myGpu.Synchronize();` right after your kernel launch (`this.myGpu.Launch...`). This will act as a barrier and wait there for the kernel to complete before allowing the host thread to continue. So it will "absorb" all the CUDA processing time in the kernel, and the remaining `CopyFromDevice` operation should then shrink down to an appropriate size in the profiler. — Robert Crovella, Nov 12 '15 at 14:39
Hi Robert. Yes, the time was spent in processing. After adding the Synchronize() call the time of CopyFromDevice was reduce to a saner amount. Please add your comment as an answer to accept it. — Julio César, Nov 12 '15 at 18:19

score 2 · Accepted Answer · answered Nov 16 '15 at 00:03

CUDA kernel launches are asynchronous. This means that immediately after launching the kernel, the CPU thread is released to process the code immediately following the kernel launch, while the kernel is still executing.

If the subsequent code contains any sort of CUDA execution barrier, then the CPU thread will then stop at the barrier until the kernel execution is complete. In CUDA, both cudaMemcpy (the operation underlying the cudafy CopyFromDevice method) and cudaDeviceSynchronize (the operation underlying the cudafy Synchronize method) contain execution barriers.

Therefore, from a host code perspective, such a barrier immediately following a kernel launch will appear to halt CPU thread execution for the duration of the kernel execution.

For this reason, the particular barrier in this example will include both the kernel execution time, as well as the data copy time. You can use the Synchronize barrier method immediately after the kernel launch to disambiguate the timing indicated by profiling the host code.

CUDAfy CopyFromDevice several orders of magnitude slower than CopyToDevice

1 Answers1