Here's my take on it.
If I define cosine similarity as
coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}
(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod
which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by @f-privé)
NumericMatrix cosine_similarity(NumericMatrix x) {
arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
// Compute the crossprod
arma::mat res = X.t() * X;
int n = x.ncol();
arma::vec diag(n);
int i, j;
for (i=0; i<n; i++) {
diag(i) = sqrt(res(i,i));
}
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
res(i, j) /= diag(i)*diag(j);
return(wrap(res));
}
(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).
Comparing those yields
> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
expr min lq mean median uq max
coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
neval cld
50 a
50 b
50 a
which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with @ f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.