Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

Question

Intel has several SIMD ISAs, such as SSE, AVX, AVX2, AVX-512 and IMCI on Xeon Phi. These ISAs are supported on different processors. For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake, but not on Xeon Phi. AVX-512F, AVX-512 CDI, AVX-512 ERI and AVX-512 PFI are supported on both the Skylake and Xeon Phi.

Why doesn't Intel design a more universal SIMD ISA that can run on all of its advanced processors?

Also, Intel removes some intrinsics and adds new ones when developing ISAs. A lot of intrinsics have many flavours. For example, some work on packed 8-bit while some work on packed 64-bit. Some flavours are not widely supported. For example, Xeon Phi is not going to have the capability to process packed 8-bit values. Skylake, however, will have this.

Why does Intel alter its SIMD intrinsics in such an inconsistent way?

If the SIMD ISAs are more compatible with each other, an existed AVX code may be ported to AVX-512 with much less efforts.

`‍For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake‍`. It is wrong. Skylake does not support AVX-512 — Amiri, Mar 11 '17 at 08:22
@Martin: confusingly AVX-512 is available on Skylake Xeon (aka Skylake X, aka Purley) but not on the original Skylake consumer CPUs. — Paul R, Nov 04 '17 at 07:06
@PaulR, Yeah, SKL Xeon does even I have found some Core i7 and Cor i9 that support AVX-512. But, SKL... My SKL doesn't... corei7 6700hq\ — Amiri, Nov 04 '17 at 20:32

hayesti · Accepted Answer · 2015-07-13T10:17:16.913

10

I see the reason why as three-fold.

(1) When they originally designed MMX they had very little area to work with so made it as simple as possible. They also did it in such a way that was fully compatible with the existing x86 ISA (precise interrupts + some state saving on context switches). They hadn't anticipated that they would continually enlarge the SIMD register widths and add so many instructions. Every generation when they added wider SIMD registers and more sophisticated instructions they had to maintain the old ISA for compatibility.

(2) This weird thing you're seeing with AVX-512 is from the fact that they are trying to unify two disparate product lines. Skylake is from Intel's PC/server line therefore their path can be seen as MMX -> SSE/2/3/4 -> AVX -> AVX2 -> AVX-512. The Xeon Phi was based on an x86-compatible graphics card called Larrabee that used the LRBni instruction set. This is more or less the same as AVX-512, but with less instructions and not officially compatible with MMX/SSE/AVX/etc...

(3) They have different products for different demographics. For example, (as far as I know) the AVX-512 CD instructions won't be available in the regular SkyLake processors for PCs, just in the SkyLake Xeon processors used for servers in addition to the Xeon Phi used for HPC. I can understand this to an extent since the CD extensions are targeted at things like parallel histogram generation; this case is more likely to be a critical hotspot in servers/HPC than in general-purpose PCs.

I do agree it's a bit of mess. Intel are beginning to see the light and planning better for additional expansions; AVX-512 is supposedly ready to scale to 1024 bits in a future generation. Unfortunately it's still not really good enough and Agner Fog discusses this on the Intel Forums.

For me I would have liked to see a model that can be upgraded without the user having to recompile their code each time. For example, instead of defining AVX register as 512-bits in the ISA, this should be a parameter stored in the microarchitecture and retrievable by the programmer at runtime. The user asks what is the maximum SIMD width available on this machine?, the architecture returns XYZ, and the user has generic control flow to cope with whatever that XYZ is. This would be much cleaner and scalable than the current technique which uses several versions of the same function for every possible SIMD version. :-/

edited Jul 13 '15 at 10:17

answered Jul 13 '15 at 10:02

hayesti

2,993
2
23
34

2

[SkyLake PC processors won't even have AVX512](http://www.kitguru.net/components/cpu/anton-shilov/intel-skylake-processors-for-pcs-will-not-support-avx-512-instructions/) – Z boson Jul 13 '15 at 12:07
@Zboson Buff, that's not good. I suppose they need a way to differentiate their Xeon line more. – hayesti Jul 13 '15 at 13:54
2

Maybe they want to give AMD a chance to catch up a bit. – Z boson Jul 13 '15 at 13:56
The suggestion in your last paragraph isn't as feasible as you think it is. Even *with* recompilation, efficient use of SIMD is still pretty poor with modern compilers. Moving this to run-time is asking for a lot more. You're basically asking to JIT, or to go all-out GPU-style programming. And there's a reason GPGPUs haven't kill off CPUs yet. – Mysticial Jul 13 '15 at 14:04
@Mysticial Well, I was thinking more in line with older vector architectures. Programmers of these machines used to strip mine loops in this way. – hayesti Jul 13 '15 at 14:26
1

@hayesti Did those architectures have out-of-order super-scalar execution? So multiple instructions could be in flight at once, if they didn't depend on each other. Modern CPUs need to do this to be fast on non-vector code (like running your web browser), so that constrains how they implement SIMD. The current model fits well into the pipeline, but does require different binaries for different CPUs. And unfortunately, other than auto-vectorization, different source, too, but that's a different problem that can and should be solved. – Peter Cordes Jul 14 '15 at 05:31
3

@PeterCordes Many of them have complex pipelines, e.g. the Alpha EV8, Cray BlackWidow and the NEX SX-ACE CPU. I know that one of the designers of the Alpha vector extension and Intel Larrabee (Roger Espasa) still argues that adopting vector-like support is the only sustainable way of growing SIMD extensions. He has been arguing for vector extensions in lieu of multimedia extensions [for quite some time](http://goo.gl/sz6xR8). Of course a vector approach would inherit its own set of problems but if Intel had more contingency in their ISA additions they could mitigate many of them. – hayesti Jul 14 '15 at 09:05
Interesting paper. One problem with fitting vector instructions into the current x86 model is how to handle a page-fault during a long-running vector instruction. For example, AVX2 `VPGATHERDD` has a mask register of which indices to actually load, and the set elements are zeroed out as the gather happens. So if there's a page fault, the mask register can be only partially cleared, and there are rules about the apparent program-order of operations (so you can tell which index faulted, because it's the leftmost one with the mask still set, or something). – Peter Cordes Jul 14 '15 at 09:33
Anyway, maybe `VPGATHERDD` is long enough to have the kind of overhead a vector instruction would have, but short enough not to really benefit, and isn't a good example of vector instructions being a bad fit for Intel's current decode-to-uops implementation. – Peter Cordes Jul 14 '15 at 09:36
1

Ok, reading more of the text, they say "one promising direction ... out of order to hide memory latencies -> short vectors are fine -> smaller vector register file. Each reg can be the size of a cache line." So they've just described a modern x86 CPU. AVX-512 will widen vectors to 64B (a cache line), and redouble the number of registers (to 32, in 64bit mode). Intel/AMD need to include all the transistors for superscalar to make non-vector code fast, and the current short-vector design is a good fit for that, according to that paper. It was written when there was just MMX, not FP SSE. – Peter Cordes Jul 14 '15 at 09:53
@PeterCordes In general I agree with you that it's not trivial to add true vector support to an x86 architecture. Like you've seen in the article, the line between multimedia extensions and vector extensions is becoming blurred. My qualm isn't that the multimedia extensions aren't useful, just that they weren't planned well. Had Intel looked at vector ISAs, scaling their widths and functionality wouldn't have been such a mess. The ISA extensions could have been made more orthogonal instead of the extremely specific combinations you see in their intrinsics (a nightmare to program). – hayesti Jul 14 '15 at 11:06
BTW, regarding page faults with vectors you might be interested in [this article](http://goo.gl/oMR8zl). – hayesti Jul 14 '15 at 11:08
1

Yup, every time I turn around, I find out there's an instruction for what I want, but only available for bytes, not words, or vice versa. Not to mention the clunkiness of `VZEROUPPER` thanks to a lack of forward planning, and filling up the opcode space. Yeah, could have been a LOT better. Some things, like wider vectors in the future, should have been forseen. I can forgive not forseeing that a separate set of mask registers would be desired, though, so AVX512 had to go and replace VEX with EVEX. I'm not impressed with the lack of a byte-element 256b shuffle, even if it was slow. – Peter Cordes Jul 14 '15 at 13:33
Neat paper. The benchmarks in that paper are a lot simpler than the kinds of work modern video codecs do. x264 has to pare down the search space of possible ways of encoding a macroblock. It needs early-outs all over the place, with different behaviour for different macroblocks. I think the CODE architecture wouldn't be anywhere near as fast branching on the result of a SAD, compared to an Intel CPU. CODE looks like it wants to have an extra layer of queueing in front of the vector unit, and is really optimized for quite long vectors. Codecs need to take early-outs. cjpeg isn't like that – Peter Cordes Jul 14 '15 at 14:40
Point being, CODE and other real vector-style designs might be somewhat better at scientific computing, but current Intel designs are solid at that, and also at things CODE is probably not good at (i.e. things where results of vector computations affect control flow). I liked the discussion of vector exception handling. Extremely similar to what `VPGATHERDD` does. The other thing to keep in mind is that it has to be bolted to x86, and you can do some things in GP regs you can't with in vector regs, so efficient GP<->vector and mixing vector + int code is useful. – Peter Cordes Jul 14 '15 at 14:52
1

@PeterCordes The "256-byte" shuffle is coming in AVX512-VBMI. Yeah, a full two-operand byte-granularity shuffle with 64-byte vectors. If this thing has single-cycle throughput, it makes me wonder how much area they have to throw at it to hold that many 128-to-1 MUXs. I bet there's a simpler design I'm not aware of that doesn't require N^2 log(N) transistors. – Mysticial Aug 20 '15 at 17:52
@Mysticial: Ya, that'll open up the possibility of much bigger in-register LUTs. `VPERMI2B` / `VPERMI2W` etc. actually give you a 64B LUT, mapping indices to values from two zmm registers holding LUT entries. (The `VPERM2TB` is similar but with one of the table regs as the output.) There's also a simpler `VPERMB` / `VPERMW` (to go with AVX2 `VPERMD`) which just use one reg as the table. This might be good for a GF16 (Galois) multiply like par2 uses, by slicing the input words into nibbles. I have an Altivec implementation that someone else wrote for its 256b shuffle of 2 128b regs. – Peter Cordes Aug 20 '15 at 19:02
@Mysticial: It'll probably have single cycle throughtput, but 3 cycle latency like current lane-crossing AVX instructions. That means you can use multi-step MUXing hardware, right, instead of one giant one. You're absolutely right that die area is a big deal for Intel, esp. since every chip has 4 cores. I think I read someone saying on the realworldtech.com forum that saving 1mm^2 in a Skylake core would be worth millions of dollars in profits for Intel. (This was in the context of them widening the FP divider so it has the same throughput for ymm as xmm, instead of halved like SnB-BDW.) – Peter Cordes Aug 21 '15 at 00:45
2

@PeterCordes Some rough latency numbers on KNL are out. It looks like single-operand permutes are 1/cycle and two-operand permutes are 0.5/cycle. KNL has two VPUs and It seems only one of them have shuffle support. And two-operand permutes are 2 uops. Can't say what this means for Skylake though. The desktop line has always had 3 VPUs. (the 3rd has all the permutes) - In any case, seeing as how difficult it is for consumers to get KNL, it might be a while before Agner can get his numbers. – Mysticial Aug 15 '16 at 17:30
1

@Mysticial Just to come back to this a year later, [ARM's Scalable Vector Extensions](https://www.hpcwire.com/2016/08/22/arm-unveils-scalable-vector-extension-hpc-hot-chips/) implement vector-length agnosticism in more or less the manner I describe above. :-) – hayesti Nov 19 '16 at 01:13

score 1 · Answer 2 · answered Aug 06 '15 at 01:39

There is SIMD ISA convergence between Xeon and Xeon Phi and ultimately they may become identical. I doubt you will ever get the same SIMD ISA across the whole Intel CPU line - bear in mind that it stretches from a tiny Quark SOC to Xeon Phi. There will be a long time, possibly infinite, before AVX-1024 migrates from Xeon Phi to Quark or a low end Atom CPU.

In order to get better portability between different CPU families, including future ones, I advise you to use higher level concepts than bare SIMD instructions or intrinsics. Use OpenCL, OpenMP, Cilk Plus, C++ AMP and autovectorizing compiler. Quite often, they will do a good job generating platform specific SIMD instructions for you.

Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

2 Answers2