Very low FLOPs/second without any data transfer

Question

I tested the following code on my machine to see how much throughput I can get. The code does not do very much except assigning each thread two nested loop,

#include <chrono>
#include <iostream>


int main() {
    auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
    for(int thread = 0; thread < 24; thread++) {
        float i = 0.0f;
        while(i < 100000.0f) {
            float j = 0.0f;
            while (j < 100000.0f) {
                j = j + 1.0f;
            }
            i = i + 1.0f;
        }
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto time = end_time - start_time;
    std::cout << time / std::chrono::milliseconds(1) << std::endl;
    return 0;
}

To my surprise, the throughput is very low according to perf

$ perf stat -e all_dc_accesses -e fp_ret_sse_avx_ops.all cmake-build-release/roofline_prediction
8907

 Performance counter stats for 'cmake-build-release/roofline_prediction':

       325.372.690      all_dc_accesses                                             
   240.002.400.000      fp_ret_sse_avx_ops.all                                      

       8,909514307 seconds time elapsed

     202,819795000 seconds user
       0,059613000 seconds sys

With 240.002.400.000 FLOPs in 8.83 seconds, the machine achieved only 27.1 GFLOPs/second, way below the CPU's capacity of 392 GFLOPs/sec (I got this number from a roofline modelling software).

My question is, how can I achieved higher throughput?

Compiler: GCC 9.3.0
CPU: AMD Threadripper 1920X
Optimization level: -O3
OpenMP's flag: -fopenmp

How did you build the code? It takes 0 seconds to run for me. — EOF, Oct 24 '21 at 16:59
Since neither `i` nor `j` are used after the loop, an optimizer can get rid of the entire thing, resulting in no execution time. Print those two values after you determine the end time of the loops. — 1201ProgramAlarm, Oct 24 '21 at 17:08
You are using only addition not addition+multiplication. So roofline must be 195gflops not 390gflops. GT1030 completes same run in 217 seconds using 24 CUDA threads. Its only 1gflops. I think this is not a good way of measuring gflops for any platform. Those loops also do not look like vectorizable at all. So you're missing whole width of SIMD units and get only 1/8 of peak performance. — huseyin tugrul buyukisik, Jan 09 '22 at 15:38

harold · Answer 1 · 2021-10-24T17:21:12.717

Compiled with GCC 9.3 with those options, the inner loop looks like this:

.L3:
        addss   xmm0, xmm2
        comiss  xmm1, xmm0
        ja      .L3

Some other combinations of GCC version / options may result in the loop being elided, after all it doesn't really do anything (except waste time).

The addss forms a loop-carried dependency with only itself in it. That is not fast though, on Zen 1 that takes 3 cycles per iteration, so the number of additions per cycle is 1/3. The maximum number of floating point additions per cycle could be attained by having at least 6 independent addps instructions (256bit vaddps may help a bit, but Zen 1 executes such 256bit SIMD instructions with 2 128bit operations internally), to deal with the latency of 3 and the throughput of 2 per cycle (so 6 operations need to be active at any time). That would correspond to 8 additions per cycles, 24 times as much as the current code.

From a C++ program, it may be possible to coax the compiler into generating suitable machine code by:

Using -ffast-math (if possible, which it isn't always)
Using explicit vectorization using _mm_add_ps
Manually unrolling the loop, using (at least 6) independent accumulators

Very low FLOPs/second without any data transfer

1 Answers1