7

Now that we have GPGPUs with languages like CUDA and OpenCL, do the multimedia SIMD extensions (SSE/AVX/NEON) still serve a purpose?

I read an article recently about how SSE instructions could be used to accelerate sorting networks. I thought this was pretty neat but when I told my comp arch professor he laughed and said that running similar code on a GPU would destroy the SIMD version. I don't doubt this because SSE is very simple and GPUs are large highly-complex accelerators with a lot more parallelism, but it got me thinking, are there many scenarios where the multimedia SIMD extensions are more useful than using a GPU?

If GPGPUs make SIMD redundant, why would Intel be increasing their SIMD support? SSE was 128 bits, now it's 256 bits with AVX and next year it will be 512 bits. If GPGPUs are better processing code with data parallelism why is Intel pushing these SIMD extensions? They might be able to put the equivalent resources (research and area) into a larger cache and branch predictor thus improving serial performance.

Why use SIMD instead of GPGPUs?

jonfrazen1
  • 87
  • 1
  • 2
  • 1
    FWIW, Intel seems to have every intention to increase SIMD size to the point where it is the same size (or even longer) than GPU widths. i.e. merging the CPU and the GPU. – Mysticial Sep 02 '14 at 19:21
  • @Mysticial Ah yes? Do you have some references I could read? – jonfrazen1 Sep 02 '14 at 19:30
  • Intel's OpenCL implementation optimizes using SSE and AVX and actually provides quite decent speedups (on their CPUs not Xeon Phi). SIMD / AVX / NEON are not going anywhere but drift away into the background. They will probably be doing the heavy lifting for various front ends (like OpenCL). – Pavan Yalamanchili Sep 02 '14 at 19:49
  • @Mysticial, according to Agner Fog's micro-architecture manual Intel has plans to got to 1024 but no further plans. – Z boson Sep 02 '14 at 20:08
  • The problem isn't SIMD on CPUs. The problem is that the cores in CPUs are too fast compared to the memory. That's why GPUs and the Xeo Phi run at a much lower frequency. Things are less memory bound for example O(n^2) operations are more efficient on GPUs/Xeon Phi than with fast CPUs. AMD probably has the right idea combining both. They have sorta given up on the x86 cores... – Z boson Sep 02 '14 at 20:15
  • @Zboson Most likely they're trying to see how well it will be adopted. If 512 and 1024 get a lot of usage, then they probably wouldn't hesitate to keep going. Most of what already runs well on GPUs (i.e. dense linear algebra) will almost certainly scale with arbitrarily large SIMD. But given that it's several years between each doubling of SIMD size, there point where CPU-SIMD reaches the width of GPU is probably far enough in the future that the paradigms will have changed by then. – Mysticial Sep 02 '14 at 20:58
  • 1
    Setting up the gpgpu takes time, time which the simd version could of already finished. The gpgpu is fast once its started but the size of the workload may not be worth it. – Joshua Waring Sep 06 '14 at 06:55

2 Answers2

10

Absolutely SIMD is still relevant.

First, SIMD can more easily interoperate with scalar code, because it can read and write the same memory directly, while GPUs require the data to be uploaded to GPU memory before it can be accessed. For example, it's straightforward to vectorize a function like memcmp() via SIMD, but it would be absurd to implement memcmp() by uploading the data to the GPU and running it there. The latency would be crushing.

Second, both SIMD and GPUs are bad at highly branchy code, but SIMD is somewhat less worse. This is due to the fact that GPUs group multiple threads (a "warp") under a single instruction dispatcher. So what happens when threads need to take different paths: an if branch is taken in one thread, and the else branch is taken in another? This is called a "branch divergence" and it is slow: all the "if" threads execute while the "else" threads wait, and then the "else" threads execute while the "if" threads wait. CPU cores, of course, do not have this limitation.

The upshot is that SIMD is better for what might be called "intermediate workloads:" workloads up to intermediate size, with some data-parallelism, some unpredictability in access patterns, some branchiness. GPUs are better for very large workloads that have predictable execution flow and access patterns.

(There's also some peripheral reasons, such as better support for double precision floating point in CPUs.)

ridiculous_fish
  • 17,273
  • 1
  • 54
  • 61
  • Thank you for those insights. Just about your point about "uploading" data to the GPU. It seems AMD's Accelerated Processing Units integrates a kind of GPGPU on the same die as the CPU cores. I'm not sure on the details, but I think they share the L2 or L3 cache. Do you think the argument applies to devices like this as well? – jonfrazen1 Sep 02 '14 at 19:27
  • 1
    You brought up branch divergence in CUDA / OpenCL and say "CPU Cores" do not have this limitation. This is an unfair statement. Firstly SIMD instructions are run per core. You can simply not have if / else statements in SIMD code. You'd have to unpack the data and do the operations separately which is probably as bad or worse compared to the branch divergence you mention. – Pavan Yalamanchili Sep 02 '14 at 19:43
  • 1
    I have been meaning to ask as similar question. I thought GPUs were basically large width SIMD devices with many slow "cores"? Isn't SIMT really a software thing and not hardware. I mean each "tread" appears to be different but it's using SIMD and each other thread in the SIMD unit has to wait for the other threads. I do something like this using `mm256_movemask_epi8` with AVX. – Z boson Sep 02 '14 at 20:13
  • Unpacking is indeed necessary for some cases, but not all, such as branching on the thread ID. Or consider the memcmp example: all that needs to be "unpacked" is a single summary bit of the register. Of course the branch itself is not a SIMD instruction, but that's because it doesn't have to be: SIMD can easily offload it to the CPU's branch machinery. GPUs don't have that luxury. – ridiculous_fish Sep 02 '14 at 20:29
  • 1
    A great example of where SIMD is still better than GPU is video encoding. The search space is so large that you need to branch based on compare results as soon as a possible way to encode a block has been ruled out. – Peter Cordes Jan 06 '18 at 00:07
1

GPU has controllable dedicated caches, CPU has better branching. Other than that, compute performance relies on SIMD width, integer core density, and instruction level parallelism.

Also another important parameter is that how far the data is to a CPU or GPU. (Your data could be an opengl buffer in a discrete GPU and you may need to download it to RAM before computing with CPU, same effect can be seen when a host buffer is in RAM and needs to be computed on discrete GPU )

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • Well, in the sorting example I thought that could be useful if the sorted array is used by the CPU afterwards. But my professor thinks it's better to give it to the GPU and get it back sorted. His research is with GPGPU things so I suppose he has a bias, but still... I have my doubts. – jonfrazen1 Sep 02 '14 at 19:42
  • 1
    What is the length of the array to be sorted and what is complexity of sorting? – huseyin tugrul buyukisik Sep 02 '14 at 19:45