5

Would it be possible to implement a class that receives a C-style pointer as a template argument and somehow resolves into a static Eigen matrix but using the memory provided? Say a declaration would look something like:

EIGEN_ALIGN16 double array[9];
CMatrix<double,3,3,array> :: m;

I do know about maps, but the example code I provide below has proven them to be slower by 20% when compared to static Eigen matrices.

These would be the premises:

  • I need to provide my own C pointer. This way I can efficiently reuse C code without incurring copies.
  • The resulting matrix should look static to Eigen so that Eigen can optimize as it would with a static array at compile time. Look at my example above where at compile time I would provide both matrix size (static) and the C pointer.
  • CMatrix should fall back to Eigen::Matrix. When the additional template parameter for the C array is not provided I would get the normal Eigen matrix.
  • I do not intend to make a full Eigen extension. With that I mean is I do not care about all kind of checks to provide a neat extension for other users. I just want the most efficient solution for this problem

Would it be possible to implement a solution by adding a new constructor? Say something like:

EIGEN_ALIGN16 double data[9];
Eigen::Matrix<double,3,3> m(data); //where data is NOT copied but used to replace the static allocation used by default.

Find below my code for benchmarking Map vs. Matrix efficiency. It is self contained and you can compile with:

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I/path_to_my_Eigen benchmark1.cpp -o benchmark1 -lrt

Here is the code:

#include <Eigen/Eigen>
#include <bench/BenchTimer.h>

#include <iostream>

using namespace Eigen;
using namespace std;

//#define CLASSIC_METHOD
#define USE_MAPS

EIGEN_DONT_INLINE void classic(double VO[4], double AT[4][4], double VI[4])
{
  for (int ii=0; ii<4; ii++)
    {
      VO[ii] = AT[ii][0] * VI[0] +
               AT[ii][1] * VI[1] +
               AT[ii][2] * VI[2] +
               AT[ii][3] * VI[3];
    }
};

template <typename OutputType, typename MatrixType, typename VectorType>
EIGEN_DONT_INLINE void modern(MatrixBase<OutputType>& VOE, const MatrixBase<MatrixType>& A44, const MatrixBase<VectorType>& VIE)
{
  VOE.noalias() = A44.transpose()*VIE;
};

int main()
{
  EIGEN_ALIGN16 double AT[4][4] = {0.1, 0.2, 0.3, 2.0, 0.2, 0.3, 0.4, 3.0, 0.3, 0.4, 0.5, 4.0, 0.0, 0.0, 0.0, 1.0};
  EIGEN_ALIGN16 double VI[4] = {1, 2, 3, 4};
  EIGEN_ALIGN16 double VO[4];

//Eigen matrices

#ifndef USE_MAPS
  Matrix4d A44 = Matrix4d::MapAligned(AT[0]);
      Vector4d VIE = Vector4d::MapAligned(VI);
  Vector4d VOE(0,0,0,0);
#else
  Map<Matrix4d,Aligned> A44(AT[0]);
  Map<Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);

  // Map<Matrix4d> A44(AT[0]);                                                                                                                                                                                                                                      
  // Map<Vector4d> VIE(VI);                                                                                                                                                                                                                                           
  // Map<Vector4d> VOE(VO);

#endif

#ifdef EIGEN_VECTORIZE
  cout << "EIGEN_VECTORIZE defined" << endl;
#else
    cout << "EIGEN_VECTORIZE NOT defined" << endl;
#endif

  cout << "VIE:" << endl;
  cout << VIE << endl;

  VI[0] = 3.14;
  cout << "VIE:" << endl;
  cout << VIE << endl;

  BenchTimer timer;

  const int num_tries = 5;
  const int num_repetitions = 200000000;

#ifdef CLASSIC_METHOD
  BENCH(timer, num_tries, num_repetitions, classic(VO, AT, VI));
  std::cout << Vector4d::MapAligned(VO) << std::endl;
#else
  BENCH(timer, num_tries, num_repetitions, modern(VOE, A44, VIE));
  std::cout << VOE << std::endl;
#endif

  double elapsed = timer.best();
  std::cout << "elapsed time: " << elapsed*1000.0 << " ms" << std::endl;

  return 0;
}
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Alejandro
  • 1,064
  • 1
  • 8
  • 22
  • Why would you think it is not possible? – danielschemmel May 05 '14 at 15:32
  • hint: start with `using std::size_t; template struct MyEfficientVector { using matrix_type = decltype(Var); matrix_type m = Var; } ;` and run from there :D – Massa May 05 '14 at 15:52
  • @Massa. Thank you. Would I need then to Inherit from Matrix/MatrixBase? how can I make this struct in the end to be an Eigen::Matrix? (I mean, I still want to have all the efficient operations, etc. that Eigen offers) – Alejandro May 05 '14 at 16:06
  • @gha.st. What would you suggest to implement this then? – Alejandro May 05 '14 at 16:06
  • @Alejandro that is a very good question. I don't know if it's even possible, given that `Eigen::MatrixBase` has its data in a `Eigen::DenseStorage`. Maybe you should start by giving a look at `Eigen/DenseStorage.h` and understand what you should do to replace its default storage schema for one of your design. In any case, your plan to do something "quick and dirty" is botched, because _if you want to extend Eigen's functionalities_, you'll certainly have to make a "proper" Eigen extension... – Massa May 05 '14 at 16:35
  • @Alejandro Go the easy way: Let the matrix object allocate the storage and ask for its [`data`](http://eigen.tuxfamily.org/dox/classEigen_1_1Matrix.html). While it is obviously possible to build your own matrix type by inheriting from `MatrixBase` it is a bit of work (although you can copy most of it from the default matrix type and would basically only need to replace `plain_array`). – danielschemmel May 05 '14 at 17:14
  • @gha.st. Thanks. Though a call to data() is not enough for me. Imagine a situation where I wan to have Matrix a and Matrix b but I'd like to both of them internally have the same array stored. – Alejandro May 05 '14 at 17:25
  • @Alejandro `Matrix a; Matrix& b = a;`? – danielschemmel May 05 '14 at 17:37
  • @gha.st. humm, I found online that this practice is usually discouraged. To me it does look like a nasty trick. Though I did considered this option I would try to avoid it. – Alejandro May 05 '14 at 18:37
  • @Alejandro Whoever advised you that marrying two objects to the same underlying store is better than using the same object twice does not rate very highly in my book. I might add that I do not see why you would wish to do this in the first place - is not one object with one name and one underlying store enough? – danielschemmel May 05 '14 at 18:54
  • @gha.st, You should read the post again. I am trying to update an existing C code to Eigen. I'd like to have the flexibility of still having my old C arrays and just ask Eigen to reuse them. Many times I would like to have several object representations of the same vector (for instance the column of a matrix might be viewed as a vector) and I'd like to still use the same memory in those cases to avoid unnecessary copies. I write numerical simulation codes where these operations might happen millions or even billions of times!!! – Alejandro May 05 '14 at 19:58

1 Answers1

5

Rather off-topic but since you stressed performance:

Eigen assembly isn't always optimal - there is some overhead due to poor register reuse and writing back to memory (this is not to blame Eigen by any means - doing this in generic templates is an impossible task).

If your kernels are fairly simple (QCD?), I would write assembly by hand (using intrinsics).

Here is your classical kernel rewritten in intrinsics, faster than Eigen version and same for Map/Matrix types (so you dont have to invent your own types).

EIGEN_DONT_INLINE void classic(double * __restrict__ VO, const double * __restrict__ AT, const double * __restrict__ VI) {
  __m128d vi01 = _mm_load_pd(VI+0);
  __m128d vi23 = _mm_load_pd(VI+2);
  for (int i = 0; i < 4; i += 2) {
    __m128d v00, v11;
    // v[i+0,i+0]                                                                                                                                                                                                   
    {
      int ii = i*4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v00 = _mm_mul_pd(at01, vi01);
      v00 = _mm_add_pd(v00, _mm_mul_pd(at23, vi23));
    }
    // v[i+1,i+1]                                                                                                                                                                                                   
    {
      int ii = i*4 + 4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v11 = _mm_mul_pd(at01, vi01);
      v11 = _mm_add_pd(v11, _mm_mul_pd(at23, vi23));
    }

    __m128d v = _mm_hadd_pd(v00, v11);
    // v is now [v00[0] + v00[1], v11[0] + v11[1]]                                                                                                                                                                               
    _mm_store_pd(VO+i, v);
    // VO[i] += AT[j+0 + i*4]*VI[j+0];                                                                                                                                                                              
    // VO[i] += AT[j+1 + i*4]*VI[j+1];                                                                                                                                                                              
  }
}

You may gain some additional improvement by interleaving loads and mul/adds - I tried to keep it simple.

These are the results:

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 611.397 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 615.541 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 981.941 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 838.852 ms

On further note, you could possibly write a better simd kernel if you matrix was transposed - horizontal adds (_mm_hadd_pd) are expensive.

To add to discussion in comments: moving Eigen maps inside the function removes difference in time between map and matrix arguments.

EIGEN_DONT_INLINE void mapped(double (&VO)[4], const double (&AT)[4][4], const double (&VI)[4]) {
  Map<const Matrix4d,Aligned> A44(&AT[0][0]);
  Map<const Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);
  VOE.noalias() = A44.transpose()*VIE;
}

This is top of the assembly when passing Map to function (function not inlined)

    movq    (%rsi), %rcx
    movq    (%rdx), %rax
    movq    (%rdi), %rdx
    movapd  (%rcx), %xmm0
    movapd  16(%rcx), %xmm1
    mulpd   (%rax), %xmm0
    mulpd   16(%rax), %xmm1

compared to passing array reference (and map inside) or matrix

    movapd  (%rsi), %xmm0
    movapd  16(%rsi), %xmm1
    mulpd   (%rdx), %xmm0
    mulpd   16(%rdx), %xmm1
Anycorn
  • 50,217
  • 42
  • 167
  • 261
  • Wow!!, this is an amazing work!! thank you!! However writing assembly for each of my operations is way beyond what I could do with my finite time. Do you know if this is something Eigen does in the background? if not, do you know of any other library that does it? what do you think about this idea I proposed of passing a C array as an additional template argument? would you know if it is feasible? trying to cheat Eigen at compile time that is creating a static array but actually with memory passed by the user. – Alejandro May 05 '14 at 23:21
  • Yes, basically Eigen generates SSE instructions (provided data is aligned, etc). However it is not as efficient as it could be. you can inspect assembly with "-S" passed to compiler to generate assembly source (omit "-o"). You can certainly hack Eigen to do that but that may be a lot of work - the internals are rather complicated. The Map should be essentially the same as Matrix. I realize that runtimes are very different - let me see if I can figure out what happens. – Anycorn May 05 '14 at 23:34
  • My guess is the difference between Map and Matrix is caused by no-inline statement - Map requires few extra manipulations to extract addresses of the arrays from function arguments. Other than that, they do exactly same thing. Move map inside the function and pass the arrays to it - results will be same. I am adding code sample to demonstrate what i mean. – Anycorn May 06 '14 at 00:03
  • Wow, yes you are right!! I tried your code and I did see the difference. So it looks like passing a Map as a function argument is a bad idea unless the function gets inlined? I still dont understand what you mean with your statement "Map requires few extra manipulations to extract addresses of the arrays from function arguments". could you expand on this? thank you! – Alejandro May 06 '14 at 01:31
  • Look at the very last two listing in the answer. see where the first three lines are `movq`? those are instructions generated where map is not inlined to extract the data addresses. Note that the instructions are completely redundant - however for whatever reason compiler put them there. Map passing is not a bad idea but not inlining functions where you do mat-vec on 4x4 matrix is a bad idea. – Anycorn May 06 '14 at 01:37
  • I would rather patch compilers, this will be more useful. It's a shame that compilers are still not able to make a proper use of the numerous registers that are available. – ggael May 06 '14 at 07:35
  • Hoi, just a short comment concerning the assembly function: It relies on aligned data using _mm_load_pd and will potentially crash when data is not aligned, so take care when creating the variables used as function parameters and create them using for instance align16(...). – gilgamash May 14 '14 at 08:25