CUFFT is 1000x slower in VS2013/Cuda7.0 compared to VS2010/Cuda4.2

Question

This simple CUFFT code was run on two IDEs -

VS 2013 with Cuda 7.0
VS 2010 with Cuda 4.2

I found that VS 2013 with Cuda 7.0 was a 1000 times slower approximately. The code executed in 0.6 ms in VS 2010, and took 520 ms on VS 2013, both on an average.

#include "stdafx.h"
#include "cuda.h"
#include "cuda_runtime_api.h"
#include "cufft.h"
typedef cuComplex Complex;
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    const int SIZE = 10000;
    Complex *h_col = (Complex*)malloc(SIZE*sizeof(Complex));
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    Complex *d_col;
    cudaMalloc((void**)&d_col, SIZE*sizeof(Complex));
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftHandle plan;
    const int BATCH = 1;
    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds;

    return 0;
}

The code was run on the same computer, with the same OS, same Graphics card, and immediately one after another. The configuration in both cases was x64 Release. You get to choose whether to compile the file using C++ compiler or CUDA C/C++. I tried both the options on both the projects and it made no difference.

Any ideas to fix this?

FWIW, I get the same results with Cuda 6.5 on VS 2013 as Cuda 7

There's an overhead associated with creating a CUDA context, which I believe is what you're measuring (I'm not certain, as I don't have access to a compiler at the moment). If you put a `cudaFree(0);` line before you start timing to initialize the context, does it behave as expected? — Jez, Jun 23 '15 at 20:48
@Jez, I added `cudaFree(0)` right before `cudaEventRecord(start);` and it did not change my timing. Still getting `500 ms` approx — The Vivandiere, Jun 23 '15 at 20:50

score 6 · Accepted Answer · answered Jun 23 '15 at 20:54

The cufft library has gotten considerably larger from 4.2 to 7.0 and it results in substantially more initialization time. If you remove this initialization time as a factor, I think you will find there will be far less than 1000x difference in execution time.

Here's a modified code demonstrating this:

$ cat t807.cu
#include <cufft.h>
#include <cuComplex.h>
typedef cuComplex Complex;
#include <iostream>
using namespace std;
int main(int argc, char* argv[])
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    const int SIZE = 10000;
    Complex *h_col = (Complex*)malloc(SIZE*sizeof(Complex));
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    Complex *d_col;
    cudaMalloc((void**)&d_col, SIZE*sizeof(Complex));
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftHandle plan;
    const int BATCH = 1;
    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds << endl;

    cudaEventRecord(start);
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds << endl;

    return 0;
}
$ nvcc -o t807 t807.cu -lcufft
$ ./t807
94.8298
1.44778
$

The second number above represents essentially the same code with the cufft initialization removed (since it was done on the first pass).

Awesome! Thanks Mr. Crovella. That indeed makes it a lot better. It is still approx 10-20x slower when compared to Cuda 4.2. Is that to be expected? — The Vivandiere, Jun 23 '15 at 21:04
Also, I see that your code executed 5x faster than mine, can you please tell me how so? I have GTX Titan, and I do not know of any other graphics card 5x faster than Titan. — The Vivandiere, Jun 23 '15 at 21:05
Accurate (perhaps I should say *understandable*) benchmarking is sometimes quite tricky, and is something like peeling back layers. The first big layer was cufft initialization time. The next step is probably to understand the contributions of each line of code in your timing region -- I suspect that the duration of the actual `cufftExecC2C` call hasn't changed much. Regarding the difference between your timing and mine, my guess is the platform difference (windows vs. linux) is the crux of the issue. Is your GTX Titan hosting a display? WDDM can make benchmarking quite tricky. — Robert Crovella, Jun 23 '15 at 21:13
No, GTX Titan is not driving my displays. The displays are connected to on-board graphics. That leaves only Windows vs Linux. The 5x difference is phenomenal. — The Vivandiere, Jun 23 '15 at 21:39
Rather than going through the laborious process of instrumenting each line of code to understand the differences, it might be better just to use one of the [profilers](http://docs.nvidia.com/cuda/profiler-users-guide/index.html#abstract) to understand what is going on at a detail level, in each case. By comparing the timeline generated by the visual profiler, for example, it will likely be more obvious why there are differences. — Robert Crovella, Jun 23 '15 at 21:43

CUFFT is 1000x slower in VS2013/Cuda7.0 compared to VS2010/Cuda4.2

1 Answers1

Linked