20

I know many examples when GPU is much faster than CPU. But exists algorithms (problems) which are very hard to parallelise. Could you give me some examples or tests when CPU can overcome GPU ?

Edit:

Thanks for suggestions! We can make a comparison between the most popular and the newest cpu's and gpu's, for example Core i5 2500k vs GeForce GTX 560 Ti.

I wonder how to compare SIMD model between them. For example: Cuda calls a SIMD model more precisely SIMT. But SIMT should be compared to the multhitreading on CPU's which is distributing threads (tasks) between MIMD cores (Core i5 2500k give as 4 MIMD cores). On the other hand each of these MIMD cores can implement SIMD model, but this is something else than SIMT and I don't know how to compare them. Finally a fermi architecture with concurrent kernel execution might be consider as MIMD cores with SIMT.

tynk
  • 211
  • 2
  • 5
  • 1
    Please read [this blog post](http://blog.stackoverflow.com/2011/08/gorilla-vs-shark/) before asking any more questions like this on stackoverflow. – talonmies Aug 17 '11 at 11:26
  • 3
    @talonmies: I disagree: this is a very specific question with a very specific answer. The only point against it is that it's *probably* a duplicate of some other question. – Joachim Sauer Aug 17 '11 at 11:36
  • 1
    possible duplicate of [GPGPU vs. Multicore?](http://stackoverflow.com/questions/5919172/gpgpu-vs-multicore) – Joachim Sauer Aug 17 '11 at 11:38
  • 1
    See "Bad candidates" in - http://www.oxford-man.ox.ac.uk/gpuss/simd.html – Flexo Aug 17 '11 at 11:39
  • I think many people would like to know the answer of this question. But I believe the question could be clearer, at least under which perspective? Or which kind of GPUs do you want to compare with CPUs? It is very easy to construct an OpenCL program that runs faster in CPU: ones that have only one thread - you will find no way to fully utilize GPU resources. However, if you want to solely talk about the parallel benchmarks out there, before Fermi, I saw some is faster in CPU. But with Fermi, all I've tested are faster in GPU (compared to XEON 5660). – Zk1001 Aug 17 '11 at 11:52
  • 1
    I think this is mostly a matter of **"when"**. Given a CPU and a GPU implementations of a certain algorithm, **when** will the CPU version be faster? The CPU will be faster when the data that is being processed is small enough to be processed on the CPU on an amount of time that is less than what it would take the GPU version to transfer the data to the GPU and process it. So, algorithms that are usually faster on the GPU might have better performance on the **CPU** on some cases, when the input data is very small, because there's the overhead of sending the data to the GPU. – karlphillip Aug 17 '11 at 13:57
  • CPUs and GPUs each have their strengths and weaknesses and hence, both are indispensable. For programs with serial execution, high amount of control-flow and high data-reuse, CPUs can be better due to stronger individual core, use of large caches and use of techniques such as branch prediction. Also, GPUs and CPUs are optimized for high-throughput and low-latency, respectively, so "better" also depends on the metric you choose. See my [survey paper](https://www.academia.edu/12355899/A_Survey_of_CPU-GPU_Heterogeneous_Computing_Techniques) for more insights. – user984260 Nov 27 '17 at 02:45

1 Answers1

19

Based on my experience, I will summarize the key differences in terms of performance between parallel programs in CPUs and GPUs. Trust me, a comparison can be changed from generation to generation. So I will just point out what is good and is bad for CPUs and GPUs. Of course, if you make a program at the extreme, i.e., having only bad or good sides, it will run definitely faster on one platform. But a mixture of those requires very complicated reasoning.

Host program level

One key difference is memory transfer cost. GPU devices requires some memory transfers. This cost is non-trivial in some cases, for example when you have to frequently transfer some big arrays. In my experience, this cost can be minimized but pushing most of host code to device code. The only cases you can do so are when you have to interact with the host operating system in program, such as outputting to monitor.

Device program level

Now we come to see a complex picture that hasn't been fully revealed yet. What I mean is there are many mysterious scenes in GPUs that haven't been disclosed. But still, we have a lot of distinguish CPU and GPU (kernel code) in terms of performance.

There are few factors that I noticed those dramatically contribute to the difference.

  • Workload distribution

GPUs, which consist of many execution units, are designed to handle massively parallel programs. If you have little of work, say a few sequential tasks, and put these tasks on a GPU, only a few of those many execution units are busy, thus will be slower than CPU. Because CPUs are, in other hand, better to handle short and sequential tasks. The reason is simple, CPUs are much more complicated and able to exploit instruction level parallelism, whereas GPUs exploit thread level parallelism. Well, I heard NVIDIA GF104 can do Superscalar, but I had no chance to experience with it though.

It is worth noting that, in GPUs, workload are divided into small blocks (or workgroups in OpenCL), and blocks are arranged in chunks, each of which is executed in one Streaming processor (I am using terminologies from NVIDIA). But in CPUs, those blocks are executed sequentially - I can't think of anything else than a single loop.

Thus, for programs that have small number of blocks, it will be likely to run faster on CPUs.

  • Control flow instructions

Branches are bad things to GPUs, always. Please bear in mind that GPUs prefer equal things. Equal blocks, equal threads within a blocks, and equal threads within a warp. But what matters the most?

                            ***Branch divergences.***

Cuda/OpenCL programmers hate branch divergences. Since all the threads somehow are divided into sets of 32 threads, called a warp, and all threads within a warp execute in lockstep, a branch divergence will cause some threads in the warp to be serialized. Thus, the execution time of the warp will be accordingly multiplied.

Unlike GPUs, each cores in CPUs can follow their own path. Furthermore, branches can be efficiently executed because CPUs have branch prediction.

Thus, programs that have more warp divergences are likely to run faster on CPUs.

  • Memory access instructions

This REALLY is complicated enough so let's make it brief.

Remember that global memory accesses have very high latency (400-800 cycles). So in old generations of GPUs, whether memory accesses are coalesced was a critical matter. Now your GTX560 (Fermi) has more 2 level of caches. So global memory access cost can be reduced in many cases. However, caches in CPUs and GPUs are different, so their effects are also different.

What I can say is that it really really depends on your memory access pattern, your kernel code pattern (how memory accesses are interleaved with computation, the types of operations, etc., ) to tell if one runs faster on GPUs or CPUs.

But somehow you can expect a huge number of cache misses (in GPUs) has a very bad effect on GPUs (how bad? - it depends on your code).

Additionally, shared memory is an important feature of GPUs. Accessing to shared memory is as fast as accessing to GPU L1 cache. So kernels that make use of shared memory will have pretty much benefit.

Some other factors I haven't really mentioned but those can have big impact on the performance in many cases such as bank conflicts, size of memory transaction, GPU occupancy...

Zk1001
  • 2,033
  • 4
  • 19
  • 36