Performance issues: Single CPU core vs Single CUDA core

Question

I wanted to compare the speed of a single Intel CPU core with the speed of an single nVidia GPU core (ie: a single CUDA code, a single thread). I did implement the following naive 2d image convolution algorithm:

void convolution_cpu(uint8_t* res, uint8_t* img, uint32_t img_width, uint32_t img_height, uint8_t* krl, uint32_t krl_width, uint32_t krl_height)
{
    int32_t center_x = krl_width  / 2;
    int32_t center_y = krl_height / 2;
    int32_t sum;
    int32_t fkx,fky;
    int32_t xx,yy;

    float krl_sum = 0;
    for(uint32_t i = 0; i < krl_width*krl_height; ++i)
        krl_sum += krl[i];
    float nc = 1.0f/krl_sum;

    for(int32_t y = 0; y < (int32_t)img_height; ++y)
    {
        for(int32_t x = 0; x < (int32_t)img_width; ++x)
        {
            sum = 0;

            for(int32_t ky = 0; ky < (int32_t)krl_height; ++ky)
            {
                fky = krl_height - 1 - ky;

                for(int32_t kx = 0; kx < (int32_t)krl_width; ++kx)
                {
                    fkx = krl_width - 1 - kx;

                    yy = y + (ky - center_y);
                    xx = x + (kx - center_x);

                    if( yy >= 0 && yy < (int32_t)img_height && xx >= 0 && xx < (int32_t)img_width )
                    {
                        sum += img[yy*img_width+xx]*krl[fky*krl_width+fkx];
                    }
                }
            }
            res[y*img_width+x] = sum * nc;
        }
    }
}

The algorithm is the same for both CPU and GPU. I also made another GPU version which is almost the same with the above. The only difference is that I am transferring the img and krl arrays to the shared memory before using them.

I used 2 images of dimensions 52x52 each and I got the following performance:

CPU: 10ms
GPU: 1338ms
GPU (smem): 1165ms

The CPU is an Intel Xeon X5650 2.67GHz and the GPU is an nVidia Tesla C2070.

Why do I get such a performance difference? It looks like a single CUDA core is 100 times slower for this particular code! Could someone explain to me why? The reasons I can think of is

the CPU's higher frequency
the CPU does branch prediction.
CPU have better caching mechanisms maybe?

What you think is the major issue that is causing this huge performance difference?

Keep in mind that I want to compare the speed between a single CPU thread and a single GPU thread. I am not trying to evaluate GPU's computing performance. I am aware that this is not the right way to do convolution on the GPU.

Why would it be only 5-10 times slower? You are comparing two **very** different multi-threading architectures. A GPU relies solely on SIMD (or SIMT) algorithms. Using only one thread makes absolutely no sense to evaluate the computing power of a GPU... — BenC, Jun 12 '13 at 05:11
This "5-10 times slower" was wrong. I will remove it. I am not trying to evaluate the computing power of a GPU. Maybe I was not very clear in the first post. I am trying to understand why there is such a huge performance difference between a single CUDA core and a single CPU core. — AstrOne, Jun 12 '13 at 05:20
Comparing 1 thread on your CPU to 1 thread on your GPU which means only 1 warp scheduler of 1 SM. The CPU core is out of order, has branch prediction, prefetch, micro-op re-ordering, 10x faster L1, 10x faster L2, ability to dispatch 6x more instructions per cycle, 4.6x faster core frequency. The Fermi architecture is not optimized for single thread performance. Increasing thread count to 32 is free if all memory operations are coalesced. Increasing warp count to 8-12/SM is also close to free due to latency hiding. — Greg Smith, Jun 12 '13 at 05:31
Thank you for your replies BenC and Greg. So, if we assume that my GPU has only one CUDA core, I don't something really wrong in the code, right? It is just the fact that the CPU's are way more sophisticated. — AstrOne, Jun 12 '13 at 05:38
They both have their specificities. CPU threads are perfect for task parallelism, while GPU threads will stand out with data parallelism. It's essential to understand the underlying architecture of CPUs and GPUs to make any comparison worth anything. SMs are not able to execute instructions at a granularity finer than 32 (warp), so even if you think that you are using only one thread, there are actually 31 threads waiting, doing nothing. You wouldn't use a chainsaw to trim a bonsai tree. Well, neither would you use only one thread on a GPU. — BenC, Jun 12 '13 at 05:56
Single GPU core can do 40 threads, needs 40 threads. Should've used 40 threads on single cuda core but API doesn't have a command for that. Maybe it could get 15x speed up by 40 threads on single cuda core. Then there are warps. Then there are SMX units. Then there are multiple GPUs. — huseyin tugrul buyukisik, Oct 05 '18 at 17:42

prem30488 · Accepted Answer · 2014-11-15T12:29:00.557

I am trying to explain, may be it will work for you.

CPU acts as host and GPU acts as device.

To run thread on GPU, CPU copies all data (Computation + DATA on which computation will be performed on) to GPU. This copying time is always greater than computation time. Because computation is performed in ALU- Arithmetic and Logical Unit. Which is Some instructions only.But copying takes more time.

So when you run only one thread in CPU, CPU has all data in its own memory, having its cache and also branch prediction, prefetch, micro-op re-ordering, 10x faster L1, 10x faster L2, ability to dispatch 6x more instructions per cycle, 4.6x faster core frequency.

But when it comes that you want to run the thread on GPU, it first copies data on GPU memory. This time take more time. Secondly, GPU cores runs grid of threads in a clock cycle.But for that we need to partition the data such that each thread gets access to one item of array. In your example it is img and krl arrays.

There is also a profiler available for nvidia GPUs. Remove codes such as printout or print in your code if they exist and try profiling your exe. It will show you copying time and computation time both in ms.

Loop Parallelization : When you run two loops to compute your image using image_width and image_height it takes more clock cycles to be performed as in instruction level it runs through counters. But when you port them on GPU, you use threadid.x and threadid.y and the grid of 16 or 32 threads that run in only one clock cycle in one core of GPU. This means it computes 16 or 32 array items in one clock cycle as it has more ALUs.(if no dependencty is there and data is partitioned well)

In your convolution algorithm you have maintained loops in CPUs, but in GPUs if you run the same loops than it will not benefit because GPU 1 thread will again act as CPU 1 thread. And also overhead of memory caches,memory copying,data partitioning etc.

I hope this will make you understand...

Performance issues: Single CPU core vs Single CUDA core

1 Answers1