5

Edit: The huge difference in performance is due to a bug in the test, when set up properly Eigen is 2 to 3 times faster.

I noticed that sparse matrix multiplication using C++ Eigen library is much slower than using Python scipy.sparse library. I achieve in scipy.sparse in ~0.03 seconds what I achieve in Eigen in ~25 seconds. Maybe I doing something wrong in Eigen?

Here Python code:

from scipy import sparse
from time import time
import random as rn

N_VALUES = 200000
N_ROWS = 400000
N_COLS = 400000

rows_a = rn.sample(range(N_COLS), N_VALUES)
cols_a = rn.sample(range(N_ROWS), N_VALUES)
values_a = [rn.uniform(0,1) for _ in xrange(N_VALUES)]

rows_b = rn.sample(range(N_COLS), N_VALUES)
cols_b = rn.sample(range(N_ROWS), N_VALUES)
values_b = [rn.uniform(0,1) for _ in xrange(N_VALUES)]

big_a = sparse.coo_matrix((values_a, (cols_a, rows_a)), shape=(N_ROWS, N_COLS))
big_b = sparse.coo_matrix((values_b, (cols_b, rows_b)), shape=(N_ROWS, N_COLS))

big_a = big_a.tocsr()
big_b = big_a.tocsr()

start = time()

AB = big_a * big_b;

end = time()

print 'time taken : {}'.format(end - start)

C++ code:

#include <iostream>
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <Eigen/Dense>
#include <Eigen/Sparse>

using namespace Eigen;

std::vector<long> gen_random_sample(long min, long max, long sample_size);
double get_random_double(double min, double max);
std::vector<double> get_vector_of_rn_doubles(int length, double min, double max);

int main()
{

  long N_COLS = 400000;
  long N_ROWS = 400000;
  long N_VALUES = 200000;

  SparseMatrix<double> big_A(N_ROWS, N_COLS);
  std::vector<long> cols_a = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<long> rows_a = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<double> values_a = get_vector_of_rn_doubles(N_VALUES, 0, 1);

  for (int i = 0; i < N_VALUES; i++)
    big_A.insert(cols_a[i], cols_a[i]) = values_a[i];
  // big_A.makeCompressed(); // slows things down

  SparseMatrix<double> big_B(N_ROWS, N_COLS);
  std::vector<long> cols_b = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<long> rows_b = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<double> values_b = get_vector_of_rn_doubles(N_VALUES, 0, 1);

  for (int i = 0; i < N_VALUES; i++)
    big_B.insert(cols_b[i], cols_b[i]) = values_b[i];
  // big_B.makeCompressed();

  SparseMatrix<double> big_AB(N_ROWS, N_COLS);

  clock_t begin = clock();

  big_AB = (big_A * big_B); //.pruned();

  clock_t end = clock();
  double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
  std::cout << "Time taken : " << elapsed_secs << std::endl;

}

std::vector<long> gen_random_sample(long min, long max, long sample_size)
{
  std::vector<long> my_vector(sample_size); // THE BUG, is right std::vector<long> my_vector

  for (long i = min; i != max; i++)
    {
      my_vector.push_back(i);
    }

  std::random_shuffle(my_vector.begin(), my_vector.end());

  std::vector<long> new_vec = std::vector<long>(my_vector.begin(), my_vector.begin() + sample_size);

    return new_vec;
}

double get_random_double(double min, double max)
{
   std::uniform_real_distribution<double> unif(min, max);
   std::default_random_engine re;
   double a_random_double = unif(re);
}

std::vector<double> get_vector_of_rn_doubles(int length, double min, double max)
{
  std::vector<double> my_vector(length);
  for (int i=0; i < length; i++)
    {
      my_vector[i] = get_random_double(min, max);
    }
  return my_vector;
}

I compiled with: g++ -std=c++11 -I/usr/include/eigen3 time_eigen.cpp -o my_exec -O2 -DNDEBUG.

Am I missing a way to do sparse multiplication fast using Eigen?

Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 1
    You should probably profile your C++ program. – Shoe Aug 02 '14 at 22:53
  • @Jefffrey, but the part I am timing is just the multiplication part, not my set up. – Akavall Aug 02 '14 at 22:56
  • @Akavall try also to compile with `-fopenmp` flag, to enable OpenMP parallelization in Eigen – vsoftco Aug 02 '14 at 22:58
  • 1
    @vsoftco, I tried it; the time is about the same. But thanks anyway. – Akavall Aug 02 '14 at 23:00
  • @Akavall it is quite strange... for dense matrices I get timing very close to `MATLAB`, like 10% slower (MATLAB btw is super optimized with Intel MKL). Are you sure that `python` is not doing some kind of lazy evaluation? You are talking here about 3 orders of magnitude slower. Your code runs in ~13 seconds on my MacBook Pro 2013. – vsoftco Aug 02 '14 at 23:03
  • And note that `clang++` ignores `-fopenmp` flag, so you may want to compile with `g++` to enable multi-processing. But yes it doesn't seem to make a difference for matrix multiplication. – vsoftco Aug 02 '14 at 23:06
  • @vsoftco, This looks very strange to me too. I don' think python is doing lazy evaluation, I can do `AB.sum()` instantly, which means that all values in `AB` are calculated. My `C++` code runs in ~13 seconds on your machine, right? – Akavall Aug 02 '14 at 23:14
  • @Akavall, yes, the `C++` runs in `13s`, and get the same `~0.03s` for python. – vsoftco Aug 02 '14 at 23:14

2 Answers2

5

If you compile without -DNDEBUG, then you will see that your matrices are corrupted because you are inserting the same elements multiple times and the insert method does not allow this.

Replace them with coeffRef(i,j) += value or use a triplet list as recommended in the documentation. After this small fix, it takes 0.012s for the C++ code, and 0.021s with Python on my computer. Note that you cannot truly deduce which one is faster from these two numbers as the input matrices are not exactly the same, but at least they are in the same order.

Bart Riordan
  • 436
  • 3
  • 8
  • 23
ggael
  • 28,425
  • 2
  • 65
  • 71
  • You are right, I had a stupid big, when I fixed it `Eigen` faster than `spipy.sparse`.Thank you very much. – Akavall Aug 05 '14 at 02:47
  • Could someone please point out exactly what to replace? I am very new to this (shy). – Ludi Oct 20 '15 at 16:59
  • In Akavall example, you would replace `big_A.insert` by `big_A.coeffRef`, and same for `big_B`. – ggael Oct 22 '15 at 19:54
0

Compare the times needed to dump the results. If python is doing lazy evaluation (i.e. computes on the fly when the result is accessed - may be sensible for sparse) you will get the time difference back.

Thinkeye
  • 655
  • 6
  • 6