Differences between current gen Xeon Processors

Question

What's the actual differences between Xeon W series, Bronze, Silver, Gold and Platinum series?

With earlier versions of Xeons, The E3 were single socket CPU's. whereas E5's could be used in motherboards with two sockets. The E7's were quad sockets supported (probably 8 too)

However, with the current generation Xeon's, Most of the lineup has a scalability of 2S (2 processors in one Motherboard)

If Xeon Silver and Xeon Platinum could be used in a dual-socket motherboard, why would I need a platinum processor, which is atleast 5X more expensive than the silver part? Unless there are other differences.

What are the differences between the current-gen Xeon processors? I see some differences in cache size. Other than that, I couldn't find anything else.

Peter Cordes · Accepted Answer · 2019-11-13T02:28:06.107

4

Gold/Platinum has more cores per socket, and/or higher base or turbo clocks. That's most of what you're paying for.

The extra UPI links that let them work in 4S or higher systems aren't relevant when being used in a 2 socket system, but that's not the only feature. Presumably it's only a small part of the cost. With the change from inclusive L3 cache to non-inclusive, Skylake Xeon and later already need a snoop filter separate from L3 tags even for single-socket, unlike Xeon E5 which just broadcast everything to the other socket. Presumably Xeon-SP's snoop filter can work for filtering snoops to the other socket as well so it didn't need to be a separate feature for 1S vs. 2S.

e.g. the top-end 2nd-gen (Cascade Lake) Intel® Xeon® Platinum 9282 Processor has 56 cores (112 threads), max turbo = 3.8 GHz, base clock = 2.6 GHz, and 77 MB of L3 Cache.

The top-end Silver is Intel® Xeon® Silver 4216: 16c/32t 3.2 GHz turbo, 2.10 GHz base, 22 MB L3 cache.

Despite have almost 4x the cores, sustained and peak turbo clocks are higher on the Platinum. (With a 400W TDP, vs. 100W for the Silver! Less-insane Platinum chips are lower TDP, e.g. a 32c/64t with 2.3GHz base / 3.7GHz turbo is 250W TDP).

Also, some (all?) Silver / Bronze CPUs only have one AVX512 FMA execution unit so throughput for 512-bit SIMD FP math instructions is reduced, including all FP math and int<->FP conversions, and _mm512_lzcnt_epi32. Look for the # of AVX-512 FMA Unit line on the Ark page for a specific CPU. For integer SIMD, only multiply is affected. (In hardware, SIMD integer multiply uops run on the FMA units.) Shifts, blends, shuffles, add/sub, compare, and boolean all have separate vector ALUs which are 512 bits wide and don't take as much die area as multipliers.

Even that top-end Silver 4216 Cascade Lake has only 1 512-bit FMA unit.

Running AVX2 code there's zero difference. Even AVX512 using only 256-bit vectors is fine. (gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256 because using 512-bit vectors at all reduces max turbo temporarily. It wants to avoid the case where one unimportant 512-bit-vectorized loop gimps the clock speed for the rest of the program that spends most of its time in scalar code.)

But if you're doing heavy AVX-512 FP number crunching you probably want a CPU with 2 FMA units and to compile with 512-bit vectors.

IDK why you tagged this Xeon Phi; that's a totally different microarchitecture.

edited Nov 13 '19 at 02:28

answered Nov 12 '19 at 15:23

Peter Cordes

328,167
45
605
847

1

I don't think integer shifts run on the FMA unit. It's true that `zmm` integer shifts only have a throughput of 1/cycle as opposed to 2 for `xmm` and `ymm` regs, but this appears to apply to all CPUs regardless of 1 or 2 FMA units, based on a check of instlat and uops.info numbers. – BeeOnRope Nov 13 '19 at 00:17
@BeeOnRope: ah maybe that's what I was remembering. Edited. And the p5 FMA unit apparently doesn't run `vpmuludq`. – Peter Cordes Nov 13 '19 at 00:19
There's `VPLZCNTD` which apparently runs on the FMA unit (its `zmm` tput is or 1 or 2 if there are 1 or 2 FMA units). – BeeOnRope Nov 13 '19 at 00:20
1

Oddly though, the results are inconsistent on whether `VPLZCNTD` causes "FMA style" downclocking. Some machines show it, some don't. – BeeOnRope Nov 13 '19 at 00:21
1

@BeeOnRope: That makes some sense; int->FP conversion needs bit-scan to normalize and calculate the exponent. [Count leading zero bits for each element in AVX2 vector](//stackoverflow.com/q/58823140) takes advantage of that for SIMD lzcnt without AVX512. – Peter Cordes Nov 13 '19 at 00:21
Yup, that's the theory and it also explains why `lzcnt` appeared suddenly before any of the other bitops: because it was "there anyway". – BeeOnRope Nov 13 '19 at 00:22
@BeeOnRope: It seems the p5 FMA unit is really only relevant for actual FP math. Integer multiply can't run on it even when present, according to Agner's testing, so in general I think it's fair to say it's only relevant for FP workloads, not integer. (Especially in a broad-strokes question like this, but I think it might be fully accurate.) – Peter Cordes Nov 13 '19 at 00:28
That's not what I see, e.g., [here](https://i.stack.imgur.com/Ro7Hd.png). Probably Agner just tested on a 1 FMA box, if you are talking about his SkylakeX numbers? He puts 0.5-1 for "FMA" but I assume he just edited that in based on knowing the 1 vs 2 issue, but for example vplzcnt shows as 1, so I think it's a 1 FMA box. – BeeOnRope Nov 13 '19 at 00:57
@BeeOnRope: hmm, interesting, thanks for the correction. https://www.agner.org/optimize/blog/read.php?i=962 doesn't say what exact model he tested. His instruction tables just list Fam 6 model 55 Stepping 4 ; I'm not sure if that tells us anything about 1 vs. 2 FMA units, and which entries he just made stuff up for :/ – Peter Cordes Nov 13 '19 at 02:32
Probably everything that says `0.5-1` in the throughput column was updated based on the FMA divergence info, not via measurement. uops.info doesn't make it clear either what was tested, you have to check a known FMA instruction and go from there. – BeeOnRope Nov 13 '19 at 02:49
Thanks for an indepth answer. Tagged as Xeon Phi, because there isnt a Xeon Tag. Would you please eloberate more on available "AVX-512 FMA Unit" in processors. Technically, what i could infer is, Math operations performance would be directly proportional to the number of FMA Units. If i am running some heavy math operations, these additonal units should multiply the performance of application accordingly. Am i correct? (Ofcourse, considering no paging is happening) – kris Nov 13 '19 at 07:56
@kris2025: that doesn't mean tagging with similar-sounding but wrong tags is a good thing. OTOH, I considered removing it but decided in this case it was almost close enough to let it slide. The whole question is mostly off-topic for Stack Overflow, it's more of a serverfault or superuser question. The 1 vs. 2 FMA units details are performance related and might be relevant to actual performance or programming questions otherwise I probably would have flagged for migration. – Peter Cordes Nov 13 '19 at 07:57
@Peter. Thanks for the clarification. Before posting here, i checked serverfault, however the questions there seemed to be specific to Linux and other OS related questions, hence posted here. will ensure no such thing happens down the line. Thanks a lot. Please feel free to migrate the question. i need an answer and am open to any sub forum. – kris Nov 13 '19 at 08:00
@kris2025: No, a lot of code bottlenecks on cache / memory bandwidth and couldn't sustain more than 1/clock 512-bit FMA/add/mul SIMD operation. So it's not generally true that performance scales with FMA units; only carefully-tuned stuff like a good BLAS library will usually manage 2 vector FMAs per clock when doing a matrix multiply. Also of course it has to be compiled for AVX512, not just AVX2+FMA. – Peter Cordes Nov 13 '19 at 08:01

Differences between current gen Xeon Processors

1 Answers1