3

I have some matrix defined as:

Eigen::MatrixXd DPCint = Eigen::MatrixXd::Zero(p.szZ*(p.na-1),p.szX);

\\ perform some computations and fill every sub-matrix of size [p.szZ,p.szX] with some values
#pragma omp parallel for
for (int i=0; i < p.na-1; i++)
{
...
DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all) = ....;
}

\\ Now sum every p.szZ rows to get a matrix that is [p.szZ,p.szX]

In Matlab, this operation is fast and trivial. I can't simply do a += operation here if I want to parallelize the loop with OpenMP. Similarly, I can loop through and sum every set of p.szZ rows, but that loop cannot be parallelized since each thread would be outputting to the same data. Is there some efficient way to use Eigen's indexing operations to sum sub-matrices? This seems like a simple operation and I feel like I'm missing something, but I haven't been able to find a solution for some time.

Clarification

Essentially, after the above loop, I'd like to do this in a single line:

for (int i = 0; i < p.na-1; i++)
{
DPC += DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all);
}

In matlab, I can simply reshape the matrix into a 3D matrix and sum along the third dimension. I'm not familiar with Eigen's tensor library, and I hope this operation is doable without resorting to using the tensor library. However, my priority is speed and efficiency, so I'm open to any suggestions.

drakon101
  • 524
  • 1
  • 8
  • 1
    "*since each thread would be outputting to the same data*" the loop access to different cells of `DPCint` (no overlapping) so it seems fine to me. – Jérôme Richard May 26 '22 at 19:50
  • 1
    Could you spell out a bit more what you want, either in Matlab or pseudo-code? – Homer512 May 26 '22 at 20:11
  • Okay, now as Jérôme pointed out, there doesn't seem to be an overlap between what the threads do. So you could just write ```DPC[i] += DPCint.middleRows(i * p.szZ, p.szZ).sum()```, right? – Homer512 May 26 '22 at 20:53
  • 2
    @Homer512 Since `i` is the iterator and it does not appear in the DPC matrix, I find it a bit weird to access to `DPC[i]`. I expect this matrix to be of size `p.szZ x p.szX`. I think the OP want to perform a matrix reduction. Thus, I do not think the expression is correct but something similar should work. – Jérôme Richard May 26 '22 at 22:00
  • 2
    In the last code there is certainly a problem using multiple threads since the DPC matrix cannot be read/written in parallel. You can perform a matrix reduction in (recent) OpenMP using `#pragma omp for readuction(+:DPC)` but this will copy the matrix for each thread which can be expensive. There are alternative solutions but we need the approximate size of the matrix to select which one is the best (when it comes to performance, every details matters). – Jérôme Richard May 26 '22 at 22:05
  • @JérômeRichard You're right. That makes sense. And yes, the approximate sizes for szZ, szX and na are definitely needed at this point – Homer512 May 26 '22 at 22:15
  • 2
    Do you need the `DPCint` matrix at any time, or could you directly accumulate into `DPC`? (Or one DPC matrix per thread which are then reduced to a single matrix ...) – chtz May 27 '22 at 09:41
  • @JérômeRichard is correct, I am looking to do a matrix reduction. Why do you need the approximate size of the matrix? A sample set of dimensions in my case could be ```[p.szZ,p.szX,na] = [500,192,15]```. – drakon101 May 27 '22 at 18:34
  • I do not need the DPCint matrix. I would directly accumulate into DPC but then I wouldn't be able to multithread the loop. If I follow @chtz suggestion, I guess my question could be reduced to "how can I set one DPC matrix per thread and automatically reduce that at the end of the loop?" – drakon101 May 27 '22 at 18:37
  • Does this answer your question? [OpenMP reduction with Eigen::VectorXd](https://stackoverflow.com/questions/40495250/openmp-reduction-with-eigenvectorxd) -- not exactly a duplicate but changes required for `MatrixXd` instead of `VectorXd` should not be too complicated. – chtz May 27 '22 at 19:52

2 Answers2

3

Performing a parallel reduction over the na-based axis is not efficient. Indeed, this dimension is already pretty small for multiple threads to be useful, but it also (nearly) force threads to operate on temporary matrices which is inefficient (this is memory-bound so it does not scale well).

An alternative solution is to parallelize the szZ dimension. Each thread can work on a slice and perform a local reduction without temporary matrices. Moreover, this approach should also improve the use of CPU caches (since a sections of DPC computed by each threads are more likely to fit in cache so they are not reloaded from RAM). Here is an (untested) example:

// All thread will execute the following loops (all iterations but on different data blocks)
#pragma omp parallel
for (int i = 0; i < p.na-1; i++)
{
    // "nowait" avoid a synchronization but this require a 
    // static schedule which is a good idea to use here anyway.
    #pragma omp for schedule(static) nowait
    for (int j = 0; j < p.szZ; j++)
        DPC(j, Eigen::all) += DPCint(i*p.szZ+j, Eigen::all);
}

As pointed out by @chtz, it should be better to avoid using a temporary DPCint matrix since the memory throughput is a very limited resource (especially in parallel code).

EDIT: I assumed the matrices are stored in a row-major storage order which is not the case by default. This can be modified (see the doc) and in fact it would make the first and the second loops cache-efficient. However mixing storage order is generally error-prone and using a row-major ordering force you to redefine basic types. The solution of @Homer512 is an alternative implementation certainly better suited for column-major matrices.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    Ohh, the outer loop is ```parallel```, not ```parallel for```. That's smart. I would put a comment on that to make it clear that this is intentional – Homer512 May 28 '22 at 08:00
2

Here is my take.

#pragma omp parallel
{
     /*
      * We force static schedule to prevent excessive cache-line bouncing
      * because the elements per thread are not consecutive.
      * However, most (all?) OpenMP implementations use static scheduling
      * by default anyway.
      * Switching to threads initializing full columns would be
      * more effective from a memory POV.
      */
#    pragma omp for schedule(static)
     for(int i=0; i < p.na-1; i++) {
         /*
          * Note: The original code looks wrong.
          * Remember that indices in Eigen (as with most things C++)
          * are exclusive on the end. This touches
          * [start, end), not [start, end]
          */
         DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ),Eigen::all) = ...;
         /*
          * Same as
          * DPCint.middleRows(i*p.szZ, p.szZ) = ...
          */
     }
     /*
      * We rely on the implicit barrier at the end of the for-construct
      * for synchronization. Then start a new loop in the same parallel
      * construct. This one can be nowait as it is the last one.
      * Again, static scheduling limits cache-line bouncing to the first
      * and last column/cache line per thread.
      * But since we wrote rows per thread above and now read
      * columns per thread, there are still a lot of cache misses
      */
#    pragma omp for schedule(static) nowait
     for(int i=0; i < p.szX; i++) {
         /*
          * Now we let a single thread reduce a column.
          * Not a row because we deal with column-major matrices
          * so this pattern is more cache-efficient
          */
         DPC.col(i) += DPCint.col(i).reshaped(
               p.szZ, p.na - 1).rowwise().sum(); 
     }
}

Reshaping is new in Eigen-3.4. However, I noticed that the resulting assembly isn't particularly effective (no vectorization).

Rowwise reductions have always been somewhat slow in Eigen. So we might do better like this, which also works in Eigen-3.3:

#    pragma omp for schedule(static) nowait
     for(int i = 0; i < p.szX; i++) {
         const auto& incol = DPCint.col(i);
         auto outcol = DPC.col(i);
         for(int j = 0; j < p.na - 1; j++)
             outcol += incol.segment(j * (p.na - 1), p.na - 1); 
     }

Alternatively, multiplying the reshaped matrix with an all-ones vector also works surprisingly well. It needs benchmarking but, especially with Eigen using OpenBLAS, it could be faster than rowwise summation.

Benchmarking

Okay, I went ahead and did some tests. First, let's set up a minimum reproducible example because we didn't have one before.

void reference(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
#   pragma omp parallel for
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint(a * szZ + z, x) =
                      a * 0.25 + x * 1.34 + z * 12.68;
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}
void test(Eigen::Ref<Eigen::MatrixXd> DPC,
          int na)
{...}
int main()
{
    const int szZ = 500, szX = 192, na = 15;
    const int repetitions = 10000;
    Eigen::MatrixXd ref = Eigen::MatrixXd::Zero(szZ, szX);
    Eigen::MatrixXd opt = Eigen::MatrixXd::Zero(szZ, szX);
    reference(ref, na);
    test(opt, na);
    std::cout << (ref - opt).cwiseAbs().sum() << std::endl;
    for(int i = 0; i < repetitions; ++i)
        test(opt, na);
}

The array dimensions are as described by OP. The DPCint initialization was chosen to be scalar and allow testing that any optimized implementation is still correct. The number of repetitions was picked for reasonable runtime.

Compiled and tested with g++-10 -O3 -march=native -DNDEBUG -fopenmp on an AMD Ryzen Threadripper 2990WX (32 core, 64 thread). NUMA enabled. Using Eigen-3.4.0.

The reference gives 16.6 seconds.

Let's optimize the initialization to get this out of the way:

void reference_op1(Eigen::Ref<Eigen::MatrixXd> DPC,
                   int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for collapse(2)
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            DPCint.col(x).segment(a * szZ, szZ) = zvals.array() + xvals[x] + avals[a];
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}

The linspaced isn't really helping but notice the collapse(2). Since na is only 15 on a 64 thread machine, we need to parallelize over two loops. 15.4 seconds

Let's test my proposed version:

void rowwise(Eigen::Ref<Eigen::MatrixXd> DPC,
             int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
              DPC.col(x) += DPCint.col(x).reshaped(szZ, na).rowwise().sum();
    }
}

Runs at 12.5 seconds. Not a lot of speedup given that we just parallelized the second half of our algorithm.

As I suggested earlier, rowwise reductions are crap and can be avoided with matrix-vector products. Let's see if this helps here:

void rowwise_dot(Eigen::Ref<Eigen::MatrixXd> DPC,
                 int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
    const Eigen::VectorXd ones = Eigen::VectorXd::Ones(szZ);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x).noalias() +=
                  DPCint.col(x).reshaped(szZ, na) * ones;
    }
}

Nope, still 12.5 seconds. What happens when we compile with -DEIGEN_USE_BLAS -lopenblas_openmp? Same number. Might be worth it if you cannot compile for AVX2 but the CPU supports it. Eigen has no support for runtime CPU feature detection. Or it might help with float more than with double because the benefit of vectorization is higher.

What if we build our own rowwise reduction in a way that vectorizes?

void rowwise_loop(Eigen::Ref<Eigen::MatrixXd> DPC,
                  int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index a = 0; a < na; ++a)
                DPC.col(x) += DPCint.col(x).segment(a * szZ, szZ);
    }
}

13.3 seconds. Note that on my laptop (Intel i7-8850H), this was significantly faster than the rowwise version. NUMA and cache line bouncing may be a serious issue on the larger threadripper but I didn't investigate perf counters.

Reordering DPCint

At this point I think it becomes apparent that the layout of the DPCint and the loop ordering in its setup are a liability. Maybe there is a reason for it. But if there isn't, I propose changing it as follows:

void reordered(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(x).segment(z * na, na) =
                      avals.array() + xvals[x] + zvals[z];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

The idea is to reshape it in such a way that a) colwise sums are possible and b) The same thread touches the same elements in the first and second loop.

Interestingly, this seems slower at 15.3 seconds. I guess the innermost assignment is now too short.

What happens if we fold both parts of the algorithm into one loop, reducing the synchronization overhead and improving caching?

void reordered_folded(Eigen::Ref<Eigen::MatrixXd> DPC,
                        int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for
    for(Eigen::Index x = 0; x < szX; ++x) {
        for(Eigen::Index z = 0; z < szZ; ++z)
            DPCint.col(x).segment(z * na, na) =
                  avals.array() + xvals[x] + zvals[z];
        DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

12.3 seconds. At this point, why do we even have a shared DPCint array? Let's use a per-thread matrix.

void reordered_loctmp(Eigen::Ref<Eigen::MatrixXd> DPC,
                      int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    const Eigen::VectorXd avals =
        Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
        Eigen::MatrixXd DPCint(na, szZ);
#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x) {
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(z) = avals.array() + xvals[x] + zvals[z];
            DPC.col(x) += DPCint.colwise().sum();
        }
    }
}

Heureka! 6.8 seconds. We eliminated cache-line bounding. We made everything cache-friendly and properly vectorized.

The only thing I can think of now is turning DPCint into an expression that is evaluated on the fly but this very much depends on the actual expression. Since I cannot speculate on that, I'll leave it at that.

Homer512
  • 9,144
  • 2
  • 8
  • 25
  • 1
    Good catch for the end of ranges. I also (wrongly) remembered that Eigen used a row major storage order while it is indeed column-major (which is suprising for a C/C++ library). Because of that, it make more sense to use `szX` as the parallelized dimension like you did (or alternatively use a manual row-major storage order so for the first operation to be a bit more efficient). – Jérôme Richard May 28 '22 at 23:05
  • Just a note regarding NUMA platforms: if the OP use a NUMA system (eg. typically AMD Zen processors or multi-socket processors) with the (default) first-touch policy, then it would be much better to parallelize the first loop along the `szX` if possible (dependent of the `....` part). Otherwise, this will cause (slower) remote page accesses which are pretty important due to the second loop being memory bound. – Jérôme Richard May 28 '22 at 23:21
  • 2
    @JérômeRichard Agreed. A lot depends on whether the first part could be restructured to make the second more effective. Regarding NUMA: I have access to some Zen1 and 2 which I configured to expose NUMA (not interleaving the memory channels). I might try to benchmark this if I find the time – Homer512 May 29 '22 at 20:50
  • @Homer512 Is this the restructuring you had in mind: `DPCint -> [p.szZ,p.szX*p.na-1]` and the second loop uses this line: `DPC.col(i) = DPCint(Eigen::all,Eigen::seqN(i,p.na-1,p.szX))*Eigen::MatrixXd::Ones(p.na-1,1);` – drakon101 Jun 02 '22 at 21:11
  • 1
    @drakon101 I added some benchmarking and more versions. The performance characteristics are really fiddly with this one. What worked on my laptop didn't necessarily work on a threadripper. – Homer512 Jun 04 '22 at 16:41