1

How can i calculate GFlops for processor: Intel Xeon E5-2670 v2 Clock speed: 2.5 GHz vCPU: 2 Memory: 7.5 GiB Storage: 1 * 32 SSD Networking Performance: Moderate(500 Mbps)

Its aws instance type: m3.large I am not able to find IPC and calculate GFlops so i can estimate my cost. Any help would be great.

1 Answers1

1

Xeon E5-xxxx v2 is an IvyBridge core, so it doesn't support FMA. See Agner Fog's microarch pdf for the details of the IvyBridge pipeline.

If you manage to avoid any memory bottlenecks, IvB can sustain a throughput of two AVX vector FP operations per clock. Execution port 1 can run vmulps or vaddps, but execution port 0 can only run vmulps.

So: 2.5G clock/sec * 2 FP vectors / clock * 8 single-precision elements / vector

Thus: single-precision 40GFlop/sec theoretical max, using AVX 256b vectors. double-precision: 20GFlop/sec (4 DP elements per 256b vector).

Note that even from L1 cache, IvB only has 128b load/store data paths, and can only sustain 2 loads and one store per 2 clocks, for 256b vectors.

mul has 5c latency, add has 3c latency, so you need enough instruction-level parallelism to keep 5 or 10 multiplies in flight at once.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Can you explain me or give me a link on how you calculate IPC ? – Anchal Khandelwal Apr 09 '16 at 03:30
  • @AnchalKhandelwal: I already did, in the first paragraph. It's highly non-trivial for real code ([see some of my other SO answers](http://stackoverflow.com/search?q=user%3Ame+latency+throughput+uops)), but easy to give the theoretical max. The vector FP mul and add units are fully pipelined, which I forgot to mention. – Peter Cordes Apr 09 '16 at 03:34
  • So to calculate GFlops, vCores could be directly used? What if my vCores is 32. This processor has 10 cores and 20 threads. Should i be doing : (2.5*10*8) * 2 processor ? cores details found at: http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E5-2670%20v2.html – Anchal Khandelwal Apr 09 '16 at 19:50
  • @AnchalKhandelwal: Like Agner Fog's microarch pdf explains, hyperthreading doesn't help *if* your code can already saturate the execution units of a core. What it can help with is code that bottlenecks on cache misses or branch mispredicts, or on latency. Unless you have very well-tuned code that can saturate a core without hyperthreading, you need to benchmark it on similar HW to see how it behaves. – Peter Cordes Apr 09 '16 at 22:03