Fast cosine similarity of two sparse matrices in Rcpp with Armadillo

Question

I'm trying to port a very fast R function for calculating cosine similarity into Rcpp with Armadillo and sparse matrix operations.

Here's the R function:

#' Compute cosine similarities between columns in x and y
#' 
#' @description adapted from qlcMatrix::cosSparse
#'
#' @param x dgCMatrix with samples as columns
#' @param y dgCMatrix with samples as columns
#' @return dgCMatrix of cosine similarities pairs of columns in "x" and "y"
sparse.cos <- function(x, y) {
  s <- rep(1, nrow(x))
  nx <- Matrix::Diagonal(x = drop(Matrix::crossprod(x ^ 2, s)) ^ -0.5)
  x <- x %*% nx
  ny <- Matrix::Diagonal(x = drop(Matrix::crossprod(y ^ 2, s)) ^ -0.5)
  y <- y %*% ny
  return(Matrix::crossprod(x, y))
}

Here's an example usage of the R function:

library(Matrix)
m1 <- rsparsematrix(1000, 10000, density = 0.1)
m2 <- rsparsematrix(1000, 100, density = 0.2)
res <- c_sparse_cos_mat_mat(m1, m2)

And here's my best stab so far at an Rcpp function (not working):

//[[Rcpp::export]]
arma::SpMat<double> sparse_cos(const arma::SpMat<double> &x, const arma::SpMat<double> &y){

    arma::vec s(x.n_rows);
    s = s.fill(1);
    arma::vec nx = arma::vec(1 / sqrt(square(x) * s));
    arma::vec ny = arma::vec(1 / sqrt(square(y) * s));
    
    // apply column-wise Euclidean norm to x and y
    for(sp_mat::const_iterator it_x = x.begin(); it_x != x.end(); it_x++)
      x.at(it_x.row(), it_x.col()) = *it_x * nx(it_x.col());

    for(sp_mat::const_iterator it_y = y.begin(); it_y != y.end(); it_y++)
      y.at(it_y.row(), it_y.col()) = *it_y * ny(it_y.col());
    
    // return cross-product of x and y as cosine distance    
    return(x * y);
}

Questions:

What is the fastest way to multiply all non-zero values in each column of SpMat x by corresponding values in a vector of length ncol(x)?
How do I fix the issues in the Rcpp function? Specifically: lvalue required as left operand of assignment in line x.at(it_x.row(), it_x.col()) = *it_x * nx(it_x.col());.
The result is inherently dense, and ideally would be returned as a dense matrix. Is there a fast method for taking the cross-product of two sparse matrices while explicitly filling in a dense matrix with the result?

Maybe a dedicated sparse matrix library can help with 1. and 3. -- see e.g. http://jefftrull.github.io/c++/eigen/csparse/suitesparse/2017/02/09/sparse-matrices-for-cplusplus.html? RcppEigen and RcppArmadillo both have some functionality but I don't have a handy 'map' of "who has what". As for 2., I found that unbundling compound statements can be of help to the compiler. Try adding a temporary variable or two. — Dirk Eddelbuettel, Dec 19 '20 at 14:10
@DirkEddelbuettel Thanks for the guidance! csparse looks promising, and for #2 I'll see how much I can decompose the issue. I just dislike having the Rstudio session crash at every failure. — zdebruine, Dec 19 '20 at 14:13
It's a complicated and important and underserved area. I recall from (well over a decade ago !!) how _e.g._ the `lme4` developer were struggling to get consistent support (and at the `RcppEigen` was the best bet). These days `RcppArmadillo` has more support too---but I can never remember whether you do or do not need CSparse or SuiteSparse or just SuperLU along with it. On well equipped machines (such as Ubuntu) it seems to "just work", I am less clear about others. In general it is still a bit frustrating and underdocumented... — Dirk Eddelbuettel, Dec 19 '20 at 14:23
We do have a number of articles at the Rcpp Gallery though---enter 'sparse' in the search box. Just keep in mind that they were written over a period of several years during which some things changed (_i.e._ improved). Diffs welcome too :) — Dirk Eddelbuettel, Dec 19 '20 at 14:25
And re 2. if it is a crash rather than a compiler error (I may have misread it) then you may have a logic bug. Even blunt `print` statements on indices can help... — Dirk Eddelbuettel, Dec 19 '20 at 14:26

Fast cosine similarity of two sparse matrices in Rcpp with Armadillo

0 Answers0