Xeon E5-xxxx v2 is an IvyBridge core, so it doesn't support FMA. See Agner Fog's microarch pdf for the details of the IvyBridge pipeline.
If you manage to avoid any memory bottlenecks, IvB can sustain a throughput of two AVX vector FP operations per clock. Execution port 1 can run vmulps
or vaddps
, but execution port 0 can only run vmulps
.
So: 2.5G clock/sec * 2 FP vectors / clock * 8 single-precision elements / vector
Thus: single-precision 40GFlop/sec theoretical max, using AVX 256b vectors. double-precision: 20GFlop/sec (4 DP elements per 256b vector).
Note that even from L1 cache, IvB only has 128b load/store data paths, and can only sustain 2 loads and one store per 2 clocks, for 256b vectors.
mul has 5c latency, add has 3c latency, so you need enough instruction-level parallelism to keep 5 or 10 multiplies in flight at once.