Doesn't see any significant improvement while using parallel block in OpenMP C++

Question

I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.

Please see the code snippet below-

#define COUNT 4

while (all_ok())
{
    Eigen::Matrix4f    trans[COUNT];
    Eigen::MatrixXf  in_data[COUNT];
    Eigen::MatrixXf out_data[COUNT];

    // at each iteration, new data is filled
    // in 'trans' and 'in_data' variables

    #pragma omp parallel num_threads(COUNT)
    {
        #pragma omp for
        for (int i = 0; i < COUNT; i++)
            out_data[i] = trans[i] * in_clouds[i];
    }
}

Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.

Any suggestions? Any alternatives to perform the same operation, please?

Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!

I doubt that four matrix multiplications are so slow that the overhead of spinning up threads for them is worthwhile. If you had ten thousand items in your arrays, maybe the picture becomes clearer (keep it at four threads or so though, ten thousands threads will slow down everything) — hlt, May 13 '18 at 09:26
Thanks! Maybe, you are right. Well, the data is being received in real-time and every time it needs to be multiplied. Is it possible to define 4 threads wherein each of them is taking care of multiplication? In this way, we don't need to create threads every time, I guess! — ravi, May 13 '18 at 09:30
As far as I know the worker threads are always started at the very start of the program (https://stackoverflow.com/questions/8132565/how-do-i-ask-openmp-to-create-threads-only-once-at-each-run-of-the-program#8133376), so that should be the default behavior. I'm not particularly experienced with OpenMP though, so I might be wrong here. — hlt, May 13 '18 at 09:34
@JesperJuhl and hlt : Thank you very much. I just added more information. Please have a look! Thanks again. — ravi, May 13 '18 at 09:42
Unrelated suggestions: You should write `out_data[i].noalias() = trans[i] * in_clouds[i];` to avoid unnecessary copying. And you can use the `Matrix4Xf` type for `in_data` and `out_data` (the latter should barely make a difference, though). — chtz, May 13 '18 at 11:52

Kaveh Vahedipour · Answer 1 · 2018-05-14T09:26:27.673

You need to specify -fopenmp during compilation and linking. But you will quickly hit the limit, where RAM access is stopping further speeding up. You really should have a look at vector intrinsics. Dependent on you CPU you could accelerate your operations to the size of your register divided by the size of your variable (float = 4). So if your processor supports say AVX, you'd be dealing with 8 floats at a time. If you need some inspiration, you're welcome to steal code from my medical image reconstruction library here: https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp The code does the whole shebang for float/double real and complex.

score 1 · Accepted Answer · answered May 14 '18 at 12:21

Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:

#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;

const int COUNT = 4;

EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
  #pragma omp parallel for num_threads(COUNT)
  for (int i = 0; i < COUNT; i++)
    out_data[i] = trans[i] * in_data[i];
}

int main()
{
  Eigen::Matrix4f    trans[COUNT];
  Eigen::MatrixXf  in_data[COUNT];
  Eigen::MatrixXf out_data[COUNT];
  int n = 500000;
  for (int i = 0; i < COUNT; i++)
  {
    trans[i].setRandom();
    in_data[i].setRandom(4,n);
    out_data[i].setRandom(4,n);
  }

  int tries = 3;
  int rep = 1;

  BenchTimer t;

  BENCH(t, tries, rep, foo(trans, in_data, out_data));

  std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";

  return 0;
}

So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.

Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.

For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.

Doesn't see any significant improvement while using parallel block in OpenMP C++

2 Answers2