0

I am implementing Latent Dirichlet Allocation (LDA) in Rcpp. In LDA, we need to deal with a huge sparse matrix (e.g. 50 x 3000).

I decided to use SparseMatrix in Eigen. However, since I need access to each cell, computationally expensive .coeffRef slows down my function a lot.

Is there any way to use SparseMatrix while keeping the speed?

What I want to do has four steps,

  1. I know which cell (i,j) I want to access.
  2. I want to know whether the cell (i,j) is 0 or not.
  3. If the cell (i,j) is not 0, I want to know its value.
  4. After doing some analysis with the value in step 2 and 3, I want to update the cell (i,j). In this step, I might need to update the cell (i,j) which originally has 0.
#include <iostream>
#include <Eigen/dense>
#include <Eigen/Sparse>
using namespace std;
using namespace Eigen;
typedef Eigen::Triplet<double> T;

int main(){

  Eigen::SparseMatrix<double> spmat;

  // Insert in spmat
  vector<T> tripletList;
  int value;

  tripletList.push_back(T(0,1,1));
  tripletList.push_back(T(0,3,2));
  tripletList.push_back(T(1,5,3));
  tripletList.push_back(T(2,4,4));
  tripletList.push_back(T(4,1,5));
  tripletList.push_back(T(4,5,6));
  spmat.resize(5,7);  // define size
  spmat.setFromTriplets(tripletList.begin(), tripletList.end());

  for(int i=0; i<5; i++){ // I am accessing all cells just to clarify I need to access cell
    for(int j=0; j<7; j++){
      // Check if (i,j) is 0  
      if(spmat.coeffRef(i,j) != 0){
        // Some analysis
        value = spmat.coeffRef(i,j)*2;  // just an example, more complex in the model
      }
      spmat.coeffRef(i,j) += value;  // update (i,j)
    }
  }

  cout << spmat << endl;

  return 0;
}

Since the number of rows is much smaller than the columns, I considered accessing a column and then check the row value, but I couldn't handle SparseMatrix<double>::InnerIterator it(spmat, colid).

user51966
  • 967
  • 3
  • 9
  • 21
  • What was wrong with using the `InnerIterator`? That seems to be exactly what you need... – Avi Ginsburg Jun 24 '19 at 07:06
  • 2
    Where is the connection to Rcpp(11)? – Ralf Stubner Jun 24 '19 at 08:13
  • 6
    1. 50x3000 is not "huge" by today's standard, it is barely above one MiB when using doubles (which easily fits into L3 cache of modern CPUs). 2. How sparse is your matrix in practice? You have (about) 4byte overhead per element when stored as sparse matrix (and a lot of computationally overhead). 3. Sparse matrices are ideally accessed iteratively, not by random-access. Consider using a different data structure. – chtz Jun 24 '19 at 08:55
  • @AviGinsburg I tried, but I want to do random access. – user51966 Jun 24 '19 at 18:58
  • @RalfStubner Since the usage of RcppEigen is the same in Cpp, I uploaded the Cpp code for simplicity. – user51966 Jun 24 '19 at 18:59
  • 1
    Your example shows sequential access. You can do that for the existing entries and then loop over the missing entries and add them when needed. However, you really should read chtz's comment at least another few times as you seem to be trying to solve the wrong problem. – Avi Ginsburg Jun 24 '19 at 19:01
  • @chtz Thanks! One of my colleagues told me the memory usage became 60GB with my current implementation when he tried it with massive data, and he advised me to use Sparse Matrix. Probably memory usage comes from a different issue. – user51966 Jun 24 '19 at 19:01
  • 2
    The 60GB memory usage definitely does not come from a single 50x3000 matrix. Unfortunately, I my crystal ball does not tell me the reason for your actual problem (and that would be a different question). – chtz Jun 25 '19 at 14:49

0 Answers0