0

I've been developing an R package for single cell RNA-seq analysis, and one of the functions I used repeatedly calculates the cosine dissimilarity matrix for a given matrix of m cells by n genes. The function I wrote is as follows:

CosineDist <- function(input = NULL) {
  if (is.null(input)) { stop("You forgot to provide an input matrix") }
  dist_mat <- as.dist(1 - input %*% t(input) / (sqrt(rowSums(input^2) %*% t(rowSums(input^2)))))
  return(dist_mat)
}

This code works fine for smaller datasets, but when I run it on anything over 20,000 rows it takes forever and then crashes my R session due to memory issues. I believe that porting this to Rcpp would make it both faster and more memory efficient (I know this is a bit of a naive belief, but my knowledge of C++ in general is limited). Finally, the output of the function, though it does not have to be a distance matrix object when returned, does need to be able to be converted to that format after its generation.

How should I got about converting this function to Rcpp and then calling it as I would any of the other functions in my package? Alternatively, is this the best way to go about solving the speed / memory problem?

πάντα ῥεῖ
  • 1
  • 13
  • 116
  • 190
jjack
  • 99
  • 7
  • How does it compare with the existing function `cosine()` in package `isa`? – dcarlson Sep 05 '20 at 23:14
  • 1
    This should be a straightforward function in Rcpp, but your question is essentially "can someone show me how Rcpp works and convert my function to C++?". That's really beyond the scope of what's possible or on-topic on Stack Overflow. Far better to read a [good free introductory Rcpp text](https://teuder.github.io/rcpp4everyone_en/) which includes a [chapter on matrices](https://teuder.github.io/rcpp4everyone_en/100_matrix.html) – Allan Cameron Sep 05 '20 at 23:18
  • I was not aware of that package or function, I'll give it a shot right now. – jjack Sep 05 '20 at 23:18
  • from checking the source code the cosine function from isa is also an pure R implementation – Steffen Moritz Sep 05 '20 at 23:20
  • I can't seem to find this `isa` package, though there is an `isa2` package, which does not have a cosine distance function. – jjack Sep 05 '20 at 23:22
  • In general you are right, using C or C++ should provide usually provide a extreme speedup (dependent on what you are doing). Had cases where it brought a 100x speedup ... – Steffen Moritz Sep 05 '20 at 23:22
  • https://cran.r-project.org/web/packages/lsa/ - you can download the source to view the implementation of the cosine function - the package "microbenchmark" is great for speed benchmarks - maybe you can compare it to your own implementation. Also for pure R implementations there can be huge speed difference (avoid loops as much as you can!) – Steffen Moritz Sep 05 '20 at 23:23
  • @AllanCameron that is a good point, I should've been more clear in that I was more hoping to determine if it was possible / actually worth doing since I'm not that familiar with the pros and cons of Rcpp. – jjack Sep 05 '20 at 23:23
  • Some related questions you may find interesting or useful: https://stackoverflow.com/q/45377700/8386140 and https://stackoverflow.com/q/44508366/8386140 – duckmayr Sep 05 '20 at 23:52
  • Unfortunately the `lsa::cosine()` function is also *very* slow. I'm reconsidering whether I'm using the correct distance metric for my data, so I'm going to try out several different options and compare speed / results. – jjack Sep 06 '20 at 00:02

1 Answers1

2

Hard to help you, since as the comments pointed out you are basically searching for an Rcpp intro.

I'll try to give you some hints, which I already mentioned partly in the comments.

In general using C/C++ can provide a great speedup (dependent on the task of course). But I've reached for (loop intensive, not optimized code) 100x+ speedups.

Since adding C++ can be complicated and sometimes cause problems, before you go this way check the following:

1. Is your R code optimized?

You can make lot of bad choices here (e.g. loops are slow in R). Just by optimizing your R code speedups of 10x or much more can often be easily reached.

2. Are there better implementations in other packages?

Especially if it is helper functions or common functionalities, often other packages have these already implemented. Benchmark different existing solutions with the 'microbenchmark' package. It is easier to just use an optimized function from another R package then doing everything on your own. (maybe the other package implementations are already in C++). I mostly try to look for mainstream and popular packages (since these are better tested and they are unlikely to suddenly drop from CRAN).

3. Profile your code

Take a look what parts exactly cause the speed / memory problems. Might be that you can keep parts in R and only create a function for the critical parts in C++. Or you find another package that has a R function that is implemented in C for exactly this critical part.

In the end I'd say, I prefer using Rcpp/C++ over C code. Think this is the easier way to go. For the Rcpp learning part you have to go with a dedicated tutorial (and not a SO question).

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55