In the Eigen docs for filling a sparse matrix it is recommended to use the triplet filling method as it can be much more efficient than making calls to coeffRef
, which involves a binary search.
For filling SparseVectors
however, there is no clear recommendation on how to do it efficiently.
The suggested method in this SO answer uses coeffRef
which means that a binary search is performed for every insertion.
Is there a recommended, efficient way to build sparse vectors? Should I try to create a single row SparseMatrix
and then store that as a SparseVector
?
My use case is reading in LibSVM files, in which there can be millions of very sparse features and billions of data points. I'm currently representing these as an std::vector<Eigen::SparseVector>
. Perhaps I should just use SparseMatrix
instead?
Edit: One thing I've tried is this:
// for every data point in a batch do the following:
Eigen::SparseMatrix<float> features(1, num_features);
// copy the data over
typedef Eigen::Triplet<float> T;
std::vector<T> tripletList;
for (int j = 0; j < num_batch_instances; ++j) {
for (size_t i = batch.offset[j]; i < batch.offset[j + 1]; ++i) {
uint32_t index = batch.index[i];
float fvalue = batch.value;
if (index < num_features) {
tripletList.emplace_back(T(0, index, fvalue));
}
}
features.setFromTriplets(tripletList.begin(), tripletList.end());
samples->emplace_back(Eigen::SparseVector<float>(features));
}
This creates a SparseMatrix
using the triplet list approach, then creates a SparseVector
from that object. In my experiments with ~1.4M features and very high sparsity this is 2 orders of magnitude slower than using SparseVector
and coeffRef
, which I definitely did not expect.