Why is my inclusive scan code 2x faster on CPU than on a GPU?

Question

I wrote a short CUDA program that uses the highly-optimized CUB library to demonstrate that one core from an old, quad-core Intel Q6600 processor (all four are supposedly capable of ~30 GFLOPS/sec) can do an inclusive scan (or cumulative/prefix sum if you rather) on 100,000 elements faster than an Nvidia 750 Ti (supposedly capable of 1306 GFLOPS/sec of single precision). Why is this the case?

The source code is:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cub/cub.cuh>

#include <stdio.h>
#include <time.h>
#include <algorithm>


#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

void fillArrayWithRandom(float* inputArray, int inputN)
{
    for (int i = 0; i < inputN; i++)
    {
        inputArray[i] = (float)rand() / float(RAND_MAX);
    }
}

void inclusiveSum_CPU(float *inputArray, float *inputSummedArray, int inputN)
{
    for (int i = 0; i < inputN; i++)
    {
        if (i > 0)
        {
            inputSummedArray[i] = inputSummedArray[i - 1] + inputArray[i];
        }
        else
        {
            inputSummedArray[i] = inputArray[i];
        }
    }
}

int main()
{
    int N = 100000; //1 hundred thousand elements
    float numSimulations = 10000;

    //Make Host Arrays
    float* testArray_CPU = (float *)malloc(sizeof(float)*N);
    fillArrayWithRandom(testArray_CPU, N);
    float* testArrayOutput_CPU = (float *)malloc(sizeof(float)*N);

    //Make GPU Arrays
    float* testArray_GPU;
    gpuErrchk(cudaMalloc(&testArray_GPU, N*sizeof(float)));
    gpuErrchk(cudaMemcpy(testArray_GPU, testArray_CPU, N*sizeof(float), cudaMemcpyHostToDevice));
    float* testArrayOutput_GPU;
    gpuErrchk(cudaMalloc(&testArrayOutput_GPU, N*sizeof(float)));

    //Initiate the benchmark variables
    clock_t begin_CPU, end_CPU;
    float time_spent_GPU, time_spent_CPU;

    //GPU prep
    void     *d_temp_storage = NULL;
    size_t   temp_storage_bytes = 0;
    cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, testArray_GPU, testArrayOutput_GPU, N);
    gpuErrchk(cudaMalloc(&d_temp_storage, temp_storage_bytes));

    //GPU Timing

    cudaEvent_t start, stop;
    gpuErrchk(cudaEventCreate(&start));
    gpuErrchk(cudaEventCreate(&stop));
    gpuErrchk(cudaEventRecord(start, 0));
    for (int i = 0; i < numSimulations; i++)
    {
        cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, testArray_GPU, testArrayOutput_GPU, N);
    }
    gpuErrchk(cudaDeviceSynchronize());
    gpuErrchk(cudaEventRecord(stop, 0));
    gpuErrchk(cudaEventSynchronize(stop));
    gpuErrchk(cudaEventElapsedTime(&time_spent_GPU, start, stop));

    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess)
    {
        printf("CUDA error: %s\n", cudaGetErrorString(error));
        exit(-1);
    }

    time_spent_GPU = (float)(time_spent_GPU / 1000);
    float avg_GPU = time_spent_GPU / numSimulations;
    printf("Avg. GPU Simulation Time: %.17g [sim/sec]\n", avg_GPU);

    //CPU Timing
    begin_CPU = clock();
    for (int i = 0; i < numSimulations; i++)
    {
        inclusiveSum_CPU(testArray_CPU, testArrayOutput_CPU, N);
    }
    end_CPU = clock();
    time_spent_CPU = (float)(end_CPU - begin_CPU) / CLOCKS_PER_SEC;
    float avg_CPU = time_spent_CPU / numSimulations;
    printf("Avg. CPU Simulation Time: %.17g [sim/sec]\n", avg_CPU);

    printf("GPU/CPU Timing:%.17gx \n", avg_GPU / avg_CPU);

    return 0;
}

And the output when I run it on my machine is:

Avg. GPU Simulation Time: 0.0011999999405816197 [sim/sec]

Avg. CPU Simulation Time: 0.00059999997029080987 [sim/sec]

GPU/CPU Timing:2x

Also, here are my compiling flags and output:

1>------ Build started: Project: speedTest, Configuration: Debug Win32 ------
1>  Compiling CUDA source file kernel.cu...
1>  
1>  C:\Users\Owner\Documents\Visual Studio 2013\Projects\speedTest\speedTest>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\nvcc.exe" -gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin" -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include"  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd  " -o Debug\kernel.cu.obj "C:\Users\Owner\Documents\Visual Studio 2013\Projects\speedTest\speedTest\kernel.cu" 
1>  kernel.cu
1>  
1>  C:\Users\Owner\Documents\Visual Studio 2013\Projects\speedTest\speedTest>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\nvcc.exe" -dlink -o Debug\speedTest.device-link.obj -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd  " -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\lib\Win32" cudart.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib  -gencode=arch=compute_50,code=sm_50 -G --machine 32 Debug\kernel.cu.obj 
1>  cudart.lib
1>  kernel32.lib
1>  user32.lib
1>  gdi32.lib
1>  winspool.lib
1>  comdlg32.lib
1>  advapi32.lib
1>  shell32.lib
1>  ole32.lib
1>  oleaut32.lib
1>  uuid.lib
1>  odbc32.lib
1>  odbccp32.lib
1>  kernel.cu.obj
1>  speedTest.vcxproj -> C:\Users\Owner\Documents\Visual Studio 2013\Projects\speedTest\Debug\speedTest.exe
1>  copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\cudart*.dll" "C:\Users\Owner\Documents\Visual Studio 2013\Projects\speedTest\Debug\"
1>  C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\cudart32_65.dll
1>  C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\cudart64_65.dll
1>          2 file(s) copied.
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

Hello hostile stranger! It was actually always 100,000 - I just had messed up the comment in the initial link I posted but I fixed it a few minutes after. And yes, the exact 2x speed-up was surprising but that indeed was how the cookie crumbled. I feel like the clock() thing is slightly splitting hairs but nonetheless I've updated the code w/CUDA's timing routines. I omitted error checking because the point of the example was for simplicity, not for production quality code. And as requested, I upped the simulation count (though it's had no substantially effect) and included the code. — tantrev, Dec 09 '14 at 05:57
The output from the new run w/the CUDA timing is: Avg. GPU Simulation Time: 0.0010889752302318811 [sim/sec] Avg. CPU Simulation Time: 0.0005685999640263617 [sim/sec] GPU/CPU Timing:1.9151870012283325x Also, what CUDA card are you using? — tantrev, Dec 09 '14 at 06:01
Yes, my bad, 100,000 elements. The comments are actually all splitting hairs, except that I'm not able to reproduce your results, so by working through these nits, maybe some additional clue will surface explaining why your results seem to be backward. (I assume you think they might be backward, or else you probably wouldn't be posting this question.) I'm pretty convinced your code can demonstrate a substantial speedup GPU vs. CPU, just not sure why it doesn't in your setup. Error checking isn't just about production code, it's about rooting out problems. It might shed some light. — Robert Crovella, Dec 09 '14 at 06:07
Absolutely, I appreciate the help. I just added error-checking on all the routines I know how (I'm not quite sure how one would error check the CUB routine) as well as checking for any CUDA errors and couldn't find anything. Perhaps your speed-up is coming from using a very powerful GPU and/or a less powerful CPU (though my Q6600 really isn't that advanced, it might be an AMD vs Intel thing?). — tantrev, Dec 09 '14 at 06:19
I tested on a Quadro5000. It might be about 1.5x faster than the GTX750Ti (based on memory bandwidth specs), but it should still be indicative of what is possible, roughly speaking. And my CPU was a Xeon W570, so it should probably be faster than your Q6600. — Robert Crovella, Dec 09 '14 at 06:20
Interesting, that is odd indeed. Maybe it might be a compiler option? I just posted my current flags and compiler output from Visual Studio 2013. — tantrev, Dec 09 '14 at 06:25
Yes, that's it. Don't build the debug configuration. Build the release configuration. The debug switch for device code (`-G`) inhibits most compiler optimizations, and this can make a substantial difference in execution speed. As an additional comment, windows may be having a slight effect here. If you want to compare perf, you can mitigate that effect by reducing the iteration count back down to 100, but increasing the elements from 100,000 to 1,000,000. This will reduce the kernel calls but increase the work per kernel, to amortize the windows overhead. But the `-G` switch is the main thing — Robert Crovella, Dec 09 '14 at 06:29
Fantastic, thank you! I'm playing around with Visual Studio to try to disable the flag and will let you know my results. Thanks again. — tantrev, Dec 09 '14 at 06:46
I wouldn't bother trying to go after the flag directly. Just build the release configuration instead of the debug configuration. This is a pull-down box in the VS toolbar. That will remove the -G switch automatically. — Robert Crovella, Dec 09 '14 at 06:48
I just discovered that right as you told me! Ha, I just got a 20x speed-up from the GPU. Problem solved! Thank you. I'll update the question. I love how I made this very simple problem into an unnecessarily complex one. — tantrev, Dec 09 '14 at 06:50
Somebody add an answer to this farce. Either that or delete it. — talonmies, Dec 09 '14 at 06:56
You always make me laugh @talonmies I'm glad you haven't "left the building". — Robert Crovella, Dec 09 '14 at 07:04

score 4 · Accepted Answer · answered Dec 09 '14 at 06:58

4

Thanks to Robert Crovella, it turns out I was using the "Debug" mode that is notoriously slow instead of "Release" mode.

answered Dec 09 '14 at 06:58

tantrev

450
1
4
11

1

It's kind of interesting that 'Debug mode' slows the GPU execution down more than it slows the CPU execution. I don't quite remember NVIDIA's docs saying that anywhere. – einpoklum Apr 20 '15 at 07:30

Why is my inclusive scan code 2x faster on CPU than on a GPU?

1 Answers1

Linked

Related