GPGPU vs. Multicore?

Question

What are the key practical differences between GPGPU and regular multicore/multithreaded CPU programming, from the programmer's perspective? Specifically:

What types of problems are better suited to regular multicore and what types are better suited to GPGPU?
What are the key differences in programming model?
What are the key underlying hardware differences that necessitate any differences in programming model?
Which one is typically easier to use and by how much?
Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?
If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?

GPU's are only spectacularly efficient over CPU's when you have a highly parallel and distributed work load. — helloworld922, May 07 '11 at 05:14
See this [related question on SuperUser](http://superuser.com/questions/308771/why-are-we-still-using-cpus-instead-of-gpus) and my [survey paper](http://goo.gl/hBK9nw) for more details. — user984260, Jul 18 '15 at 19:44

score 42 · Accepted Answer · answered May 09 '11 at 23:49

Interesting question. I have researched this very problem so my answer is based on some references and personal experiences.

What types of problems are better suited to regular multicore and what types are better suited to GPGPU?

Like @Jared mentioned. GPGPU are built for very regular throughput workloads, e.g., graphics, dense matrix-matrix multiply, simple photoshop filters, etc. They are good at tolerating long latencies because they are inherently designed to tolerate Texture sampling, a 1000+ cycle operation. GPU cores have a lot of threads: when one thread fires a long latency operation (say a memory access), that thread is put to sleep (and other threads continue to work) until the long latency operation finishes. This allows GPUs to keep their execution units busy a lot more than traditional cores.

GPUs are bad at handling branches because GPUs like to batch "threads" (SIMD lanes if you are not nVidia) into warps and send them down the pipeline together to save on instruction fetch/decode power. If threads encounter a branch, they may diverge, e.g., 2 threads in a 8-thread warp may take the branch while the other 6 may not take it. Now the warp has to be split into two warps of size 2 and 6. If your core has 8 SIMD lanes (which is why original warp pakced 8 threads), now your two newly formed warps will run inefficiently. The 2-thread warp will run at 25% efficiency and the 6-thread warp will run at 75% efficiency. You can imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low. Therefore, GPUs aren't good at handling branches and hence code with branches should not be run on GPUs.

GPUs are also bad a co-operative threading. If threads need to talk to each other then GPUs won't work well because synchronization is not well-supported on GPUs (but nVidia is on it).

Therefore, the worst code for GPU is code with less parallelism or code with lots of branches or synchronization.

What are the key differences in programming model?

GPUs don't support interrupts and exception. To me thats the biggest difference. Other than that CUDA is not very different from C. You can write a CUDA program where you ship code to the GPU and run it there. You access memory in CUDA a bit differently but again thats not fundamental to our discussion.

What are the key underlying hardware differences that necessitate any differences in programming model?

I mentioned them already. The biggest is the SIMD nature of GPUs which requires code to be written in a very regular fashion with no branches and inter-thread communication. This is part of why, e.g., CUDA restricts the number of nested branches in the code.

Which one is typically easier to use and by how much?

Depends on what you are coding and what is your target.

Easily vectorizable code: CPU is easier to code but low performance. GPU is slightly harder to code but provides big bang for the buck. For all others, CPU is easier and often better performance as well.

Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?

Task-parallelism, by definition, requires thread communication and has branches as well. The idea of tasks is that different threads do different things. GPUs are designed for lots of threads that are doing identical things. I would not build task parallelism libraries for GPUs.

If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?

Lots of problems in the world are branchy and irregular. 1000s of examples. Graph search algorithms, operating systems, web browsers, etc. Just to add -- even graphics is becoming more and more branchy and general-purpose like every generation so GPUs will be becoming more and more like CPUs. I am not saying they will becomes just like CPUs, but they will become more programmable. The right model is somewhere in-between the power-inefficient CPUs and the very specialized GPUs.

score 24 · Answer 2 · answered May 07 '11 at 05:07

Even in a multi-core CPU, your units of work are going to be much larger than on a GPGPU. GPGPUs are appropriate for problems that scale very well, with each chunk of work being increadibly small. A GPGPU has much higher latency because you have to move data to the GPU's memory system before it can be accessed. However, once the data is there, your throughput, if the problem is appropriately scalable, will be much higher with a GPGPU. In my experience, the problem with GPGPU programming is the latency in getting data from normal memory to the GPGPU.

Also, GPGPUs are horrible at communicating across worker processes if the worker processes don't have a sphere of locality routing. If you're trying to communicate all the way across the GPGPU, you're going to be in a lot of pain. For this reason, standard MPI libraries are a poor fit for GPGPU programming.

All computers aren't designed like GPUs because GPUs are fantastic at high latency, high throughput calculations that are inherently parallel and can be broken down easily. Most of what the CPU doing is not inherently parallel and does not scale to thousands or millions of simultaneous workers very efficiently. Luckily, graphics programming does and that's why all this started in GPUs. People have increasingly been finding problems that they can make look like graphics problems, which has led to the rise of GPGPU programming. However, GPGPU programming is only really worth your time if it is appropriate to your problem domain.

score 0 · Answer 3 · answered Jul 06 '23 at 20:49

What types of problems are better suited to regular multicore and what types are better suited to GPGPU?

Each GPU pipeline is similar to SMT of CPUs except it has 8-way or 16-way threading instead of just 2. This makes powerful latency-hiding opportunities between so-called "threads" or workitems. Even without instruction-level-parallelism, you can have really high occupancy per pipeline.

On the other hand, for CPU threads, especially those without SMT, you have to have optimized instruction series to keep the core doing work full-wide and full-depth. Something like hand-optimized AVX512 instructions (or at least have a really good compiler that can do same for you, if you make it clear for it) on top of optimized synchronization techniques. Also starting a thread on CPU is heavy work. Once it starts, it better work a lot. But on a GPU, you launch 1 million threads in few microseconds. That's equivalent to 1 thread launch per picosecond.

What are the key differences in programming model?

All cores of a CPU share same RAM. So you can access same data in any (synchronized) way you like from multiple threads. But multiple GPUs can not automagically share a variable on RAM, especially if one card is AMD and other card is Nvidia, on an Intel motherboard. At least not without paging-ability through pcie bridge. So for multiple GPUs, you have to make some memory model for your software. Will it be a shared-distributed memory model on centralized approach or a pipelined dataflow between GPUs like a grid? On CPUs, you can have any approach as even multiple CPUs on same motherboard can access same variable on RAM, although with extra latency.

What are the key underlying hardware differences that necessitate any differences in programming model?

GPUs require a CPU to start computing a kernel. CPU starts from power button. So, you can't have a real OS running purely on GPU. But you can simulate one and this would be too slow because each "clock" signal of virtual CPU running in GPU would be like 5-10 microseconds. So you could hit ctrl+alt+del and see 16384 logical cores in your virtual OS but barely able to render the windows due to all the logic handling & the messaging would be running on pipelines at 2GHz without CPU-features like out-of-order execution, branch-prediction, etc. It would be like Pentium-I or Pentium-II overclocked on single-thread workloads. But it would be cool to have 16384 pages of a website hosted from RTX4090 if it had the ability to directly use ethernet as if its a CPU. For now, the data has to pass through RAM/CPU to process the clicks of clients of websites.

Which one is typically easier to use and by how much?

As CPU has direct access to RAM, it is easier to optimize streaming-type workloads on CPU. For example, adding 1 to all elements of an array is one of those. To add 1 on elements of array by GPU, you have to send it to GPU first, then run kernel, then copy results back, all in order with proper API commands and possibly with pipelining to hide some latency (which won't beat the CPU anyway since pcie bandwidth can't surpass the RAM bandwidth on same computer).

CUDA is easier than hand-optimizing AVX512 CPU codes. Maintaining OpenCL code is harder than maintaining CPU codes. Depends on what you are doing.

Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?

If you have implemented std::map for GPU to be used in kernels, then you already made it parallel. No need for its "more parallel" version. Since GPU threads should not diverge, any kind of std::map would be block-based instead of thread-based. Multiple threads would insert/delete from same map at the same time instead of working on their on maps. That would be horribly slow with all the independent allocations.

If you meant std::map outside kernel, but accelerated by GPUs, then why not? Even just using video-memory as storage instead of consuming RAM(if its not enough already) could be helpful sometimes. GPU could even compress/decompress big chunks of data without much latency penalty. For example, std::map<int,DNA> could have parallel-Huffman decoders on GPU-side to get fast decompression to serve all threads of CPU.

If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?

APU: there's a GPU inside of this so it's kind of a GPU-like abilities in a CPU, just with similar way of accessing like any discrete GPU.
Vector-processors, specialized CPUs for mainframes that are expensive but can do a lot of calculations in parallel.
AVX512 has 16 architectural pipelines for 32-bit floating-point operations. 16 additions & 16 multiplications at once. It's a bit smaller version of 32CUDA cores of a warp. Some CPUs have dual AVX512 units so its 64GFLOPS per GHz per core. One Ryzen 7900 core has 48 peak flops per cycle (32 adds & 16 muls). AMD GPUs have something like 64 pipelines per compute unit (128 flops per cycle).
There are 1000+ core CPUs
A specialized CPU has 400k cores. 400000 + few thousands of spare cores just in case some fail.

GPGPU vs. Multicore?

3 Answers3

Linked