Why do I get higher Whetstone FLOPS from SiSoft Sandra when I disable extensions (SSE, AVX, FMA)?

Question

I'm working on a college assignment for my computer architecture class and we have to run different benchmark tests on our personal computers to determine how different technologies affect its efficiency.

I'm using SiSoftware Sandra Lite 2021 for the benchmark tests (in the option Benchmark > Processor > Processor Arithmetic) and my CPU is a Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz.

First, I got the results by running the benchmark with all the options on (all extensions, multithreading, hyperthreading and so on). Then, I got the results with all the extensions disabled (SSE, AVX, FMA, AES, SHA...).

Here is the table with the mean average of three benchmark results for each different test:

Benchmark used	Extensions enabled	Extensions disabled
Dhrystone Integer Native	113.47 GIPS (AVX2)	83 GIPS (ALU)
Whetstone Single-float Native	83.32 GFLOPS (AVX/FMA)	88.76 GFLOPS (FPU)
Whetstone Double-float Native	68.79 GFLOPS (AVX/FMA)	82.38 GFLOPS (FPU)

Here's the question: why do I get higher scores in the Whetstone benchmark when I disable all the extensions?

I do understand I get a lower score in the Dhrystone one when I disable extensions because a lot of them run on the SIMD (Single Instructions Multiple Data) principle. However, since a lot of extenions also help the CPU do floating point operations faster, I was expecting the same thing to happen in the Whetstone results.

Any idea why I got these results?

Thanks in advance.

I'm including the list of features my CPU supports, in case it's useful:

Features:
- HTT - Hyper-Threading Technology : Yes
- SSE2 - Streaming SIMD Extensions v2 : Yes
- SSE3 - Streaming SIMD Extensions v3 : Yes
- SSE4.1 - Streaming SIMD Extensions v4.1 : Yes
- SSE4.2 - Streaming SIMD Extensions v4.2 : Yes
- AES - Accelerated Cryptography Support : Yes
- AVX - Advanced Vector eXtensions : Yes
- FMA3 - Fused Multiply/Accumulate eXtensions : Yes
- AVX2 - Advanced Vector eXtensions v2 : Yes

That doesn't make much sense, unless CPU frequency is throttling to a factor of 16 lower. (Which doesn't sound plausible). Maybe less if Whetstone isn't vectorized *efficiently*. Or if the plain benchmark is actually also vectorized. For theoretical max FLOPS on your Skylake CPU, a 256-bit FMA has the same throughput cost as a scalar multiply or add; either is a single uop for port 0 or 1. (https://www.realworldtech.com/haswell-cpu/). For single-precision, it gets 8 (SIMD) FMAs done, each counting as two FLOPs, so a 16x speedup. Of course it might bottleneck on bandwidth or latency... — Peter Cordes, Sep 12 '22 at 17:45
So it probably depends a lot on how Whetstone was optimized, and what bottlenecks it has. It's not something that's going to be true in general; `gcc -O3 -march=native -ffast-math` can give big speedups over `-O3 -fno-tree-vectorize` (SSE2 is baseline for x86-64, so I'm a bit curious what they did to disable SIMD usage, if anything; possibly you're competing against SSE2 SIMd.) — Peter Cordes, Sep 12 '22 at 17:47
I advise you to profile the program (eg. with VTune for example since you are likely on Windows). You can see the asm code of the hotspot loops (so you can check if SSE is actually used of not) and the instructions that are the bottleneck quite easily (so you can see if the code is properly optimized). Besides, please check you use a high-performance governor (a benchmark using a power-save governor can explain such weird results). — Jérôme Richard, Sep 13 '22 at 19:34

Why do I get higher Whetstone FLOPS from SiSoft Sandra when I disable extensions (SSE, AVX, FMA)?

0 Answers0