0

I got a document term matrix of ~1600 documents x ~120 words. I would like to compute the cosine similarity between all these vectors, but we are speaking about ~1,300,000 comparisons [n * (n - 1) / 2].

I used parallel::mclapply with 8 but it still takes forever.

Which other solution do you suggest?

Thanks

Bakaburg
  • 3,165
  • 4
  • 32
  • 64
  • 1
    I'd suggest to do this in a compiled language. Look into Rcpp. – Roland Jul 28 '17 at 16:02
  • What's the form of a cosine similarity between words? Do you have a sparse matrix? – F. Privé Jul 28 '17 at 17:13
  • If I got what you mean, a document term matrix is a document by word matrix where each value is the weight (there are many weight to choose from) of the word for the document. Yes, it's very sparse. BTW I found a problem in how the tm package created the document term matrix, as I shown here https://stackoverflow.com/questions/31932387/documenttermmatrix-return-0-terms-in-tm-package, but this problem doesn't change the above question. – Bakaburg Jul 28 '17 at 20:13

3 Answers3

4

Here's my take on it.

If I define cosine similarity as

coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}

(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by @f-privé)

NumericMatrix cosine_similarity(NumericMatrix x) {
  arma::mat X(x.begin(), x.nrow(), x.ncol(), false);

  // Compute the crossprod                                                                                      
  arma::mat res = X.t() * X;
  int n = x.ncol();
  arma::vec diag(n);
  int i, j;

  for (i=0; i<n; i++) {
    diag(i) = sqrt(res(i,i));
  }

  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
      res(i, j) /= diag(i)*diag(j);

  return(wrap(res));
}

(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).

Comparing those yields

> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max
               coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
              coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
 cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
 neval cld
    50  a 
    50  b 
    50  a 

which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with @ f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.

ekstroem
  • 5,957
  • 3
  • 22
  • 48
  • Many thanks, your base R implementation is actually very fast. I was using lsa::cosine, but many order of magnitude slower than your solution! – Bakaburg Jul 29 '17 at 21:24
  • I see from an error that `lsa::cosine` utilises `crossprod`. Any idea if it did do so 4 years ago, or if the function has had a rewrite? – geotheory Aug 06 '21 at 10:33
  • 1
    OK looks like that was implemented back in 2005 - https://github.com/cran/lsa/blob/master/R/cosine.R - so no indication this 2017 comparison isn't still valid.. – geotheory Aug 06 '21 at 10:39
2

I very agree with @ekstroem on the use of crossprod but I think there are unnecessary computations in his implementation. I think by the way that coss is giving a wrong result. Comparing his answer with mine you can use this cpp file:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix cosine_similarity(NumericMatrix x) {                                                             

  arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
  arma::mat rowSums = sum(X % X, 0);
  arma::mat res;

  res = X.t() * X / sqrt(rowSums.t() * rowSums);

  return(wrap(res));
}

// [[Rcpp::export]]
NumericMatrix& toCosine(NumericMatrix& mat,
                        const NumericVector& diag) {

  int n = mat.nrow();
  int i, j;

  for (j = 0; j < n; j++) 
    for (i = 0; i < n; i++) 
      mat(i, j) /= diag(i) * diag(j);

  return mat;
}

/*** R
coss <- function(x) { 
  crossprod(x)/(sqrt(crossprod(x^2)))
}

coss2 <- function(x) {
  cross <- crossprod(x)
  toCosine(cross, sqrt(diag(cross)))
}

XX <- matrix(rnorm(120*1600), ncol=1600)

microbenchmark::microbenchmark(
  cosine_similarity(XX), 
  coss(XX), 
  coss2(XX),
  times = 20
)
*/

Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
 cosine_similarity(XX) 172.1943 176.4804 181.6294 181.6345 185.7542 199.0042    20
              coss(XX) 262.6167 270.9357 278.8999 274.4312 276.1176 337.0531    20
             coss2(XX) 134.6742 137.6013 147.3153 140.4783 146.5806 204.2115    20

So, I will definility go for computing the crossprod in base R and then do the scaling in Rcpp.

PS: If you have a very sparse matrix, you could use package Matrix to convert your matrix to a sparse matrix. This new class of matrix also have the crossprod method so you could use coss2 as well.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
1

The coop package's coop::cosine function is probably the best way to do this now. It is implemented in Rcpp, but also has a different approach than lsa::cosine, and also has lower memory overhead. Its use is exactly the same as lsa::cosine, just switch out the package names.

For further speedups, you may want to change your BLAS library. The coop manual has a few basic details and suggestions.

mlinegar
  • 1,389
  • 1
  • 11
  • 19