2

I am using Sparse Matrices in Eigen and I observe the following behavior:

I have the following Sparse Matrices with Column Major storage

  • A [1,766,548 x 3,079,008] with 105,808,194 non-zero elements and

  • B [3,079,008 x 1,766,548] with 9,476,108 non-zero elements

When I compute the dot product AxB in takes almost 8 seconds.

When I want to compute transpose(Β) x transpose(A), the computation cost seems to increase a lot. In fact, this runs for about ~2,500 seconds.

Note that I load the transposed tables from files and I don't transpose them with Eigen.

I didn't expect the two approaches to have exactly the same computational cost but I don't really understand such a difference in execution time as in both approaches the two matrices have exactly the same number of non-zero elements.

I am using g++ 7.4 and Eigen 3.3.7

serafeim
  • 21
  • 7
  • I guess you are compiling with `-O3`? If that's already the case, look at the number of non-zeros in the results of both products to see whether there is a huge difference that could explain the differences. Maybe you are running out of memory and your system starts to swap? – ggael Dec 02 '19 at 20:07
  • Yes, compilation is done with -O3 and I have plenty of ram still available. The non-zeros of the result shouldn't be exactly the same as in the second scenario the multiplied matrices are the transpose of those in the first scenario? I will double-check this to be sure... – serafeim Dec 02 '19 at 20:32
  • Note that in the "fast" scenario the matrix A that contains more non-zero elements is in the left side of the dot product operation and in the "slow" scenario it is in the right side. I can't find what algorithm is used for the for the sparse matrix multiplication but can this combined with the column major layout of the matrices cause such a difference in performance ? – serafeim Dec 02 '19 at 20:57
  • 1
    Sorry, I did not paid enough attention to your question. Indeed, for each column of `AxB`, the underlying algorithm will accumulate a very few but quite dense columns, whereas in the transposed case it will accumulate numerous but very sparse columns. Accumulations are performed within a dense vector. Regarding cache-friendliness, the first case is thus definitely more favorable, but such a huge speed difference is still surprising. – ggael Dec 02 '19 at 23:02
  • 1
    You can try with a different algorithm by calling `(Bt*At).pruned()`. – ggael Dec 02 '19 at 23:03
  • 1
    Thank you @ggael! In fact, I was using the `pruned()` method. When I switched to the default dot product the execution time for the second case dropped to ~28 seconds. It is still a speed difference but not so huge. Are you aware of the two different algorithms used in these cases? I can't find the algorithms used for the sparse matrix multiplication somewhere in the documentation. – serafeim Dec 03 '19 at 08:38
  • Any reason why you are not using `(A*B).transpose()`? – chtz Dec 03 '19 at 09:48
  • Ah that is because in my application, in the general case, the two tables could be totally different. Just in this scenario the tables happen to be the transpose of the first ones and I've noticed this behavior. – serafeim Dec 03 '19 at 10:29
  • 1
    @serafeim I implemented both ;) They both compute one column of the result at once by performing updates of the form `res.col(j) += A.col(k)*B(k,j)` but they differ in how those sparse updates are handled. Looks like the pruned version is wrongly assuming a fully dense result. – ggael Dec 03 '19 at 12:58

0 Answers0