2

For a small benchmark of OpenMP on an i7-6700K I wrote the following code:

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>

constexpr int bench_rounds = 32;

int main(void) {
    using std::chrono::high_resolution_clock;
    using std::chrono::duration_cast;
    using std::chrono::duration;
    using std::chrono::milliseconds;

    
    size_t vec_size = 16;
    size_t large_vec_size = 1024*16;
    std::vector<double> test_vec(large_vec_size * large_vec_size, 0);

    auto t1 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(int j = 0; j < large_vec_size; ++j) {
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t2 = high_resolution_clock::now();
    duration<double, std::milli> ms_double = t2 - t1;
    std::cout << ms_double.count() << "ms\n";

    return 0;
}

I compiled it using four different approaches:

  1. With the latest intel compiler, using icpc main.cpp -o test
  2. With the latest intel compiler, using icpc -qopenmp main.cpp -o test -liomp5
  3. With GCC 11.2, using g++ main.cpp -o test
  4. With GCC 11.2, using g++ -fopenmp main.cpp -o test -lgomp

My obtained results were:

  1. Warning "unrecognized OpenMP #pragma #pragma omp parallel for collapse(2)", runtime: 2490 ms
  2. No warning, runtime: 14080 ms
  3. No warning, runtime: 45550 ms
  4. No warning, runtime: 13400 ms

The result for GCC is as more or less as expected, I am running it on four cores, and my speed-up is slightly larger than three. But for the intel compiler I do not understand the result: Why is it so much faster, especially when disregarding the OpenMP-pragma?

To extend the benchmark data after requests in the comments, my compile lines are

g++ main.cpp -o test_gcc_clean
g++ -fopenmp main.cpp -o test_gcc_omp -lgomp
g++ -fopenmp -march=native -mavx -O3 main.cpp -o test_gcc_opt -lgomp
icpc main.cpp -o test_icc_clean
icpc -qopenmp main.cpp -o test_icc_omp -liomp5
icpc -qopenmp -march=native -mavx -O3 main.cpp -o test_icc_opt -liomp5

and my run file is:

echo "Clean GCC"
./test_gcc_clean
echo "GCC with OpenMP"
./test_gcc_omp
echo "Optimized GCC"
./test_gcc_opt
echo "Clean ICC"
./test_icc_clean
echo "ICC with OpenMP"
./test_icc_omp
echo "Optimized ICC"
./test_icc_opt

The output then is:

Clean GCC
45641.6ms
GCC with OpenMP
13358.6ms
Optimized GCC
4949.53ms
Clean ICC
2471.96ms
ICC with OpenMP
14014.6ms
Optimized ICC
13662.9ms

reflecting my earlier results. Interestingly, a program compiled with

icpc -march=native -mavx -O3 main.cpp -o test_icc_nomp

will be even faster, and have a runtime of 1286 ms, but will throw an error during compilation stating that it does not know the OpenMP pragma.

Edit:
Following the suggestions in the answer I decided to extend the question for the sake of completeness. It should test the assumption that the comparison between size_t and int slows down the code, and a final verification to avoid optimization removal of the test vectors. Therefore, I used the following code:

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>

constexpr int bench_rounds = 32;

int main(void) {
    using std::chrono::high_resolution_clock;
    using std::chrono::duration_cast;
    using std::chrono::duration;
    using std::chrono::milliseconds;

    
    size_t vec_size = 16;
    size_t large_vec_size = 1024*16;
    std::vector<double> test_vec(large_vec_size * large_vec_size, 0),
    test_vec_II(large_vec_size * large_vec_size, 0),
    test_vec_III(large_vec_size * large_vec_size, 0),
    test_vec_IV(large_vec_size * large_vec_size, 0);

    auto t1 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(int j = 0; j < large_vec_size; ++j) {
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t2 = high_resolution_clock::now();

    auto t3 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for
    for(int j = 0; j < large_vec_size; ++j) {
        #pragma omp simd
        for(int i = 0; i < large_vec_size; ++i) {
            test_vec_II[j * large_vec_size + i] = i + j + test_vec_II[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t4 = high_resolution_clock::now();

    auto t5 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for collapse(2)
    for(size_t j = 0; j < large_vec_size; ++j) {
        for(size_t i = 0; i < large_vec_size; ++i) {
            test_vec_III[j * large_vec_size + i] = i + j + test_vec_III[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t6 = high_resolution_clock::now();

    auto t7 = high_resolution_clock::now(); 
    for(int k = 0; k < bench_rounds; ++k) {
    #pragma omp parallel for
    for(size_t j = 0; j < large_vec_size; ++j) {
        #pragma omp simd
        for(size_t i = 0; i < large_vec_size; ++i) {
            test_vec_IV[j * large_vec_size + i] = i + j + test_vec_IV[j * large_vec_size + i] * 1e-13;
        }
    }
    }
    auto t8 = high_resolution_clock::now();


    duration<double, std::milli> ms_double = t2 - t1, 
    ms_double_simd = t4 - t3, 
    ms_double_sizet = t6 - t5, 
    ms_double_simd_sizet = t8 - t7;
    std::cout << "Coll: " << ms_double.count() << " ms\n";
    std::cout << "SIMD: " << ms_double_simd.count() << " ms\n";
    std::cout << "CoST: " << ms_double_sizet.count() << " ms\n";
    std::cout << "SIST: " << ms_double_simd_sizet.count() << " ms\n";

    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_II.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    }
    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_III.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    }
    std::cout << "Vectors are equal: ";
    if(std::equal(test_vec.begin(), test_vec.begin() + large_vec_size * large_vec_size, test_vec_IV.begin())) {
        std::cout << "True\n";
    } else {
        std::cout << "False\n";
    }

    return 0;
}

and obtained the following results:

Clean GCC
Coll: 46281.8 ms
SIMD: 47917.9 ms
CoST: 44322 ms
SIST: 44275.4 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
GCC with OpenMP
Coll: 13799.6 ms
SIMD: 14546 ms
CoST: 12913.8 ms
SIST: 13113.1 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized GCC
Coll: 4955.54 ms
SIMD: 5080.45 ms
CoST: 5203.64 ms
SIST: 5011.17 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized GCC, no OpenMP
Coll: 5201.49 ms
SIMD: 5198.48 ms
CoST: 6148.23 ms
SIST: 6279.94 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Clean ICC
Coll: 2579.12 ms
SIMD: 5315.75 ms
CoST: 5296.52 ms
SIST: 6892.02 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
ICC with OpenMP
Coll: 14466 ms
SIMD: 4974.81 ms
CoST: 13539.5 ms
SIST: 4963.63 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized ICC
Coll: 15753.4 ms
SIMD: 5114.96 ms
CoST: 13509.4 ms
SIST: 5100.88 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True
Optimized ICC, no OpenMP
Coll: 1302.34 ms
SIMD: 5200.3 ms
CoST: 5535.02 ms
SIST: 5565.15 ms
Vectors are equal: True
Vectors are equal: True
Vectors are equal: True

At least on my platform I get some interesting results:

  • A loop with int and size_t can be optimized better by ICC (and in some cases for GCC) than a loop with only size_t
  • #pragma simd is faster than #pragma omp collapse(), but the version without OpenMP is still significantly faster, especially for ICC
arc_lupus
  • 3,942
  • 5
  • 45
  • 81
  • 1
    What do you get if you use maximum optimization (e.g. -O3 -arch:native flags)? – Laci Sep 16 '21 at 08:14
  • @Laci: I added those numbers. I get a clear speed-up when using GCC (by a factor of 9 compared to the un-optimized version), but a slow-down in the case of the intel compiler – arc_lupus Sep 16 '21 at 08:40
  • Your code is not computation expensive, it is memory bound, because you make a very few operations on a very large array. It seems that the vectorization significantly increase the speed, but the vectorized code will not be any faster if OpenMP is used on your hardware. My guess is that g++ without OpenMP, but using -O3 -mavx2 flags also will be fast. (Note that on your processor you should use -mavx2) – Laci Sep 16 '21 at 09:22
  • In case of a memory bound code, speed depends on the memory controller, number of memory channels, cache utilization/prefetch etc. It is not surprising that Intel compiler can do it efficiently on an Intel processor. OpenMP has a significant overhead and in the case of a memory bound code it is not worth using it (at least on your hardware if vectorization is used). Note that removing `collapse(2)` can increase the speed of OpenMP. – Laci Sep 16 '21 at 09:43

1 Answers1

4

The problem comes from collapse(2) clause and is also related to the code auto-vectorization. Indeed, both compilers are not able to auto-vectorize the loop with the collapse, but ICC use a very expensive idiv instruction in the middle of the hot loop (which is very bad) while GCC produce a better code. This comes from the collapse(2) clause which is not well optimized (on many compilers).

You can see that on Godbolt. Note that optimizing a kernel using a collapse(2) clause is not easy since the compiler do not know the bound of the loops and so the associated cost (as well as the divider for the modulus).

Without the collapse(2), GCC is able to vectorize the loop successfully but surprisingly not ICC. Fortunately, we can help ICC using the simd directive. Once used, the two compilers generate a relatively good code. It is still not optimal because size_t is generally 8 bytes and int is 4 bytes on mainstream x86-64 platforms and the loop comparison of the different types makes the code harder to vectorize efficiently as well as to produce the best scalar instructions. You can use a temporary variable to fix that. You can see the resulting assembly code here.

Note that the assembly generated by ICC is very good once the code is fixed. The code is memory bound and the final code should saturate the RAM with only few threads. Even the L3 cache should be saturated with the ICC produced assembly if the input array would fit into it.

Here is the fixed code:

for(int k = 0; k < bench_rounds; ++k) {
    int limit = large_vec_size;
    #pragma omp parallel for
    for(int j = 0; j < limit; ++j) {
        #pragma omp simd
        for(int i = 0; i < limit; ++i) {
           test_vec[j * large_vec_size + i] = i + j + test_vec[j * large_vec_size + i] * 1e-13;
        }
    }
}

Moreover, not that the result is not used and optimizing compilers can just remove the benchmark code because it does not actually have any side effects (dead-code elimination). Generally, you can fix that by just printing the result. This is also useful to check the results.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thanks for the extensive answer! I extended my code to test your suggestions, and interestingly the mixture of `size_t` and `int` is still faster than pure `size_t`, which might be a result from the longer counter variable type. Keeping the loop variables purely as `int` might improve the speed even more, which will be the next step to test. – arc_lupus Sep 16 '21 at 19:21
  • @arc_lupus Yes I was surprised too when I tested using `size_t`. I expected the code to be faster since `sizeof(size_t)` is equal to `sizeof(double)` on most modern platforms (eg. 64-bit ones). The reason why this is faster with `int` is because there is no SIMD x86 SSE/AVX/AVX2 instruction to convert a 64-bit integer to a double... This is a mess and cause the compiler not to vectorize the code (at least not efficiently). Note that AVX-512 add this missing important instruction. Sadly AVX-512 is only supported on very recent x86-64 processors. – Jérôme Richard Sep 16 '21 at 19:44
  • At least based on the latest bench runs the best combination is using ICC as compiler, without OpenMP-directives and ideally purely with int as datatype. Still, a mixed size_t/int-approach is basically as fast as the pure int-approach. When only using size_t my runtime increases by a factor of four. Adding OpenMP to the mix results in a runtime increase between two and five. – arc_lupus Sep 17 '21 at 06:30