How to measure overhead of a kernel launch in CUDA

Question

I want to measure the overhead of a kernel launch in CUDA.

I understand that there are various parameters which affect this overhead. I am interested in the following:

number of threads created
size of data being copied

I am doing this mainly to measure the advantage of using managed memory which has been introduced in CUDA 6.0. I will update this question with the code I develop and from the comments. Thanks!

You may wish to take a look at Section 6.1.1 of the CUDA Handbook by N. Wilt. — Vitality, Jun 24 '14 at 05:07
The source code in the concurrency directory for The CUDA Handbook has apps that measure kernel launch overhead. I have not tested the implications of managed memory but heard at GTC that the driver copies all managed memory back to host memory on synchronization. So synchronous kernel launch overhead should increase as you allocate more managed memory. https://github.com/ArchaeaSoftware/cudahandbook/blob/master/concurrency/nullKernelSync.cu — ArchaeaSoftware, Jun 24 '14 at 13:37

score 6 · Accepted Answer · edited Dec 16 '15 at 22:14

How to measure the overhead of a kernel launch in CUDA is dealt with in Section 6.1.1 of the "CUDA Handbook" book by N. Wilt. The basic idea is to launch an empty kernel. Here is a sample code snippet

#include <stdio.h>

__global__ void EmptyKernel() { }

int main() {

    const int N = 100000;

    float time, cumulative_time = 0.f;
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    for (int i=0; i<N; i++) { 

        cudaEventRecord(start, 0);
        EmptyKernel<<<1,1>>>(); 
        cudaEventRecord(stop, 0);
        cudaEventSynchronize(stop);
        cudaEventElapsedTime(&time, start, stop);
        cumulative_time = cumulative_time + time;

    }

    printf("Kernel launch overhead time:  %3.5f ms \n", cumulative_time / N);
    return 0;
}

On my laptop GeForce GT540M card, the kernel launch overhead is 0.00245ms.

If you want to check the dependence of this time from the number of threads launched, then just change the kernel launch configuration <<<*,*>>>. It appears that the timing does not significantly change with the number of threads launched, which is consistent with the statement of the book that most of that time is spent in the driver.

Yes, this answers the overhead for number of threads part. I will try and implement varying size of memory. Thank you! — pranith, Jun 30 '14 at 18:45

score 2 · Answer 2 · answered Nov 07 '15 at 14:27

Perhaps you should be interested in these test results from the University of Virginia:

Memory transfer overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

Kernel launch overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html

They were measured in a similar way to JackOLantern proposal.

How to measure overhead of a kernel launch in CUDA

2 Answers2