Why is VC++ matrix times vector faster with openMP than async?

Question

I coded multiplication of a vector by a matrix two ways, one with openMP, the other with std::async. I expected the performance to be virtually identical. OpenMP is slow on the first call, probably because it defers making a thread pool. After that, the async version is consistently 40% slower. (I have Intel core i5, which is 4 cores.)

What's the deal? Does VC++ not use a thread pool for async? Am I doing something silly? (Most likely.) I think access to the output vector is spaced out enough to avoid false sharing.

#include "stdafx.h"
# include <iostream>
# include <vector>
# include <omp.h>
# include <ctime>
#include <numeric>
#include <thread>
#include <chrono>
#include <future>

// Matrix multiplication of vector using omp
template<class Mat, class Vec>
void mult_mat_vec_omp
(const Mat &mat, const Vec &inp, Vec &out) {
    const int steps = static_cast<int>(std::size(mat));
    using std::begin; using std::end;
    auto N = std::thread::hardware_concurrency();
    omp_set_num_threads(N); 
#pragma omp parallel for 
    for (int i=0; i < steps; ++i) {
        out[i] = std::inner_product(begin(mat[i]), end(mat[i]), begin(inp), 0.0);
    }
}


// Matrix multiplication of vector using async
template<class Mat, class Vec>
void mult_mat_vec_async
(const Mat &mat, const Vec &inp, Vec &out) {
    using std::begin; using std::end;
    auto N = std::thread::hardware_concurrency();
    typedef decltype(N) usigned;
    const unsigned steps = static_cast<unsigned>(std::size(mat));
    auto f = [&](unsigned id) {
    for (unsigned i=id; i < steps; i+= N) {
            out[i] = std::inner_product(begin(mat[i]), end(mat[i]), begin(inp), 0.0);
        }
    };
    std::vector<std::future<void>> threads;
    for (unsigned i = 1; i<N; ++i) {
        threads.push_back(std::async(std::launch::async, f, i));
    }
    f(0);
    for (auto &x: threads) {
        x.get();
    }
}


double test() {
    using std::vector;
    using clock=std::chrono::high_resolution_clock;
    vector<double> a;
    vector<double> b;
    vector<double> c;
    vector<vector<double>> mat;
    vector<double> v;
    int rows = 350;
    int cols = 350;
    for (int i = 0; i< cols; ++i) {
        a.push_back(i/10.0);
        b.push_back(-999);
        c.push_back(8888);
    }
    for (int i=0; i<rows; ++i) {

        v.clear();
        for (int j=0; j<cols; ++j) {
            v.push_back (((i+.5)*j)/100.0);
        }
        mat.push_back(v);
    }
    clock::time_point start = clock::now();
    int N = 10000;
    for (int i=0; i< N/10; ++i) { 
    mult_mat_vec_omp(mat, a, b) ;
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);
    mult_mat_vec_omp(mat, a, b);

    };
    long long duration =  
std::chrono::duration_cast<std::chrono::milliseconds>(clock::now()-start).count();
    start = clock::now();
    size_t cutoff = 0; // 2*rows*cols;
    for (int i=0; i< N/10; ++i) { 
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);
    mult_mat_vec_async(mat, a, c);

    };
 long long duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(clock::now()-start).count();
    //cout << mat[0][5] << " " << b[0] << " " << c[0] << endl;
    bool good = (b==c);
    std::cout << duration*(1.0/N) << ' ' << duration2*(1.0/N) << " " << good << std::endl;
    return 0;
}

int main ()
{
    for(int i=0; i<15; ++i) test();
    return 0;
}

Seconds per iteration (Release mode). It should be obvious how to reduce that if you have to catch a bus. Cut back on the N=10000. — Jive Dadson, Oct 24 '17 at 15:49
For one, the memory access pattern in the `std::async` version is pretty terrible. Each thread is striding by `N`. In the OpenMP version, you let OpenMP handle the work distribution. And it's probably using a block distribution instead of stride distribution. — Mysticial, Oct 24 '17 at 15:51
I'm not sure if it makes a difference here, but it means that each thread is touching 4x more cachlines than it needs to. Another possibility is that you're getting false-sharing on the writes to `out[i]`. They're interleaved between the threads, so there's a real possibility of a conflict. — Mysticial, Oct 24 '17 at 15:56
I think you may be right about touching more lines. I doubt it's false sharing. I think I remember that the cache lines are short. I'll work on it. BRB — Jive Dadson, Oct 24 '17 at 16:01
I tried using block distribution. No joy. The stackoverflow software is nagging about an extended discussion. Later... — Jive Dadson, Oct 24 '17 at 16:11
it seems like openmp doesn't do anything, and the openmp macros have no affects. it runs synchronously, and this is why the thread version is faster. — David Haim, Oct 24 '17 at 16:31
Could you post the specific processor model and performance numbers. That makes it easier to reproduce and give a specific answer. — Zulan, Oct 24 '17 at 17:31

score 2 · Accepted Answer · answered Oct 24 '17 at 18:18

On a Intel Core i7-2600, disabled HT, with gcc 7.2 / Linux, the numbers are somewhat different, the async version is about 10% slower.

Now the discussion is on the right track regarding cache efficiency and false sharing. You should try to access consecutive elements by the same thread, at least up to the size of a cache line (e.g. 64 byte). For reading you simply save on memory accesses by using cache / data locality more efficiency - for writing it is even worse because false sharing will bounce around the cache line between cores. However, it is important to recognize that this is not about the actual data access - that happens within std::inner_product and is the same for both versions. If the actual data access be in such a thread-interleaved pattern, the performance would be much worse than 40% off.

Now its easy to avoid and test if it helps:

const unsigned steps = static_cast<unsigned>(std::size(mat));
auto f = [&](unsigned id) {
    const auto chunk_size = 1 + ((steps - 1) / N);
    const auto max = std::min(chunk_size * (id + 1), steps);
    for (unsigned i = chunk_size * id; i < max; i++)
    {
        out[i] = std::inner_product(begin(mat[i]), end(mat[i]), begin(inp), 0.0);
    }
};

In my configuration that eliminates all performance differences between the versions.

If you still see a difference in performance on your system, I would recommend using a suitable performance analysis tool. I'm not familiar with your ecosystem so can't do any recommendations - but it's important not to guess about performance.

Note that a std::vector<std::vector<>> is not a good data structure for high-performance data access / matrix multiplication. You won't get close to the performance of a highly optimized library using contiguous memory for the matrix.

Thanks a lot. On my machine, changing from striding to chunking that way made a 10% improvement. The async version is still 30% slower than omp. Very strange. What manner of magic is the omp version doing under the hood? — Jive Dadson, Oct 25 '17 at 06:19
I changed "mat" in the test routine from > to double mat[rows][cols].. I was chuffed that the templates worked without change, but I saw no speed improvement. — Jive Dadson, Oct 25 '17 at 06:36

Why is VC++ matrix times vector faster with openMP than async?

1 Answers1