1

I recently wrote a compute intense function in Rcpp. Now, I would like to port this code to an R package. However, I notice that the code is much (~100x) slower when run inside an R package.

I already read here, that this may have to do with how the function is called. However, this is not a one-time cost. Instead, it scaled with the number of iterations in the Rcpp function (only a single call to Rcpp is made).

Please find a minimally complete verifiable example below. The function below doesn't do anything useful but shows the behavior I am worried about.

How can I troubleshoot this issue?

Steps to recreate package.

  • Use Rcpp.package.skeleton to create a new package skeleton with Rcpp.

  • Add the following example.cpp file to \src.

    example.cpp

    #include <Rcpp.h>
    
    // [[Rcpp::export]]
    int example_cpp(Rcpp::IntegerMatrix mat, int iters) {
      for(int i = 0; i < iters; ++i) {
        std::vector<int> vec;
        std::iota(std::begin(vec), std::end(vec), 0);
      }
      return 0;
    }
    
  • Add the following example.R file to \R.

    example.R

    # @export
    example <- function(mat, iters) {
      example_cpp(mat, iters)
    }
    
  • Test the Rcpp function inside/outside the package using the following script.

    library(examplePackage)
    
    Rcpp::sourceCpp('src/example.cpp')
    exampleOutside <- function(mat, iters) {
      example_cpp(mat, iters)
    }
    
    set.seed(42)
    mat <- replicate(n=1000, sample(1:10))
    
    for(iters in c(1e4, 1e5, 1e6)) {
      res <- microbenchmark::microbenchmark(
        example(mat, iters),
        exampleOutside(mat, iters),
        times=10
      )
      print(iters)
      print(res)
    }
    

Output.

[1] 10000
Unit: microseconds
                       expr     min      lq     mean  median      uq      max neval
        example(mat, iters) 629.550 630.977 696.1131 686.488 719.399  858.081    10
 exampleOutside(mat, iters)   3.143   4.203 239.7205   5.021   6.981 2340.719    10
[1] 1e+05
Unit: microseconds
                       expr      min       lq      mean    median       uq      max neval
        example(mat, iters) 6512.453 6625.420 6717.6595 6713.2375 6843.519 6921.158    10
 exampleOutside(mat, iters)    2.637    3.226    7.6473    4.1205   12.647   16.489    10
[1] 1e+06
Unit: microseconds
                       expr       min        lq       mean    median        uq       max neval
        example(mat, iters) 64091.144 66392.745 67491.8759 68001.405 68609.006 69028.736    10
 exampleOutside(mat, iters)     2.885     3.574    10.6664     4.792    17.653    35.927    10
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
s1624210
  • 627
  • 4
  • 11
  • 4
    It's even more difficult for us to know which information may be relevant here. I think sharing the code of the C++ function, and examples of how it is called in R (inside and outside of your package) would be necessary to have any hope of helping you though. – Allan Cameron Dec 05 '22 at 12:35
  • 1
    Rcpp can be blazingly fast, but things that can slow it down are numerous and generally outside of its control. For instance, calling regular R functions (requiring repeated back-and-forth between your Rcpp function and the R interpreter) can be a problem. I don't have enough experience with it to know how being called from within a package would change its behavior. If Dirk weighs in (he often responds to [tag:rcpp]), I suspect that while he may have insight about these package boundaries, he'll also chastise asking a code-centric question in the absence of code. – r2evans Dec 05 '22 at 12:58
  • The question is unanswerable as it is. Either edit it to give us reproducible code, or delete it. – Dirk Eddelbuettel Dec 05 '22 at 13:22
  • 1
    StackOverflow recommends [minimally complete verifiable examples](https://stackoverflow.com/help/minimal-reproducible-example). All four terms matter. Minimal is one of them. Sadly I will not have the time to dig through one-hundred-ish lines of yours. R has wonderful facillities for profiling, they may help you spot bottlenecks in your code you can examine then in detail. Good luck! – Dirk Eddelbuettel Dec 05 '22 at 14:29
  • @DirkEddelbuettel Please find a MCVE above! – s1624210 Dec 05 '22 at 15:40
  • @AllanCameron The question is edited accordingly. – s1624210 Dec 05 '22 at 15:50

1 Answers1

1

Edit: See below for the full discussion. You are measuring noise as the functions are empty returning immediately. For a real problem I replace your function and it unused argument and data with a simple summation of the log of the loop value:

#include <Rcpp.h>

// [[Rcpp::export]]
double nontrivial(int iters) {
    double sum = 0;
    for (int i = 0; i < iters; ++i) {
        sum += log(i * 1.0);
    }
    return sum;
}

Running over 10 million elements the default one hundred takes just under five seconds, and post approaches are equally fast:

edd@rob:~/git/stackoverflow/74688018(master)$ Rscript nontrivial.R
     test replications elapsed relative
1  inside          100   4.743    1.000
2 outside          100   4.920    1.037
edd@rob:~/git/stackoverflow/74688018(master)$ 

Updated files are in the repo linked below. The original reply from yesterday follows.


Thanks for making the problem smaller. There are numerous problems here:

  • you risk naming collission by calling the function you source the same as the one the package when you load the package, I would avoid that
  • the matrix creations
    • nonsensical: what is replicate doing there
    • pointless: the matrix is not used in the code
  • after I made some rearrangements I set up a simpler test (see below) and it made it pretty clear that neither function is doing anything as the compiler looked through your empty loop. You can return zero faster by just returning zero, what is happening here:

Simpler Test

Rcpp::sourceCpp("exampleOutside.cpp")

set.seed(42)
mat <- matrix(sample(1:10, 100, replace=TRUE), 10, 10)

rbenchmark::benchmark(examplePackage::example_cpp(mat, 10),
                      exampleOutside(mat, 10),
                      replications = 10)[,1:4]

Outcome

> rbenchmark::benchmark(examplePackage::example_cpp(mat, 10), 
+                       exampleOutside(mat, 10), replications = 10)[,1:4]
                                  test replications elapsed relative
2              exampleOutside(mat, 10)           10   0.001       NA
1 examplePackage::example_cpp(mat, 10)           10   0.000       NA
> 

Modified 'outside' function

#include <Rcpp.h>

// [[Rcpp::export]]
int exampleOutside(Rcpp::IntegerMatrix mat, int iters) {
  for(int i = 0; i < iters; ++i) {
    std::vector<int> vec;
    std::iota(std::begin(vec), std::end(vec), 0);
  }
  return 0;
}

Overall, this is a fairly good "teachable moment" about why minimal examples are recommended. You were chasing a mirage, and you didn't notice because you put up too many barriers for yourself to see that maybe your comparison setup was not quite right.

In short, "when something cannot be true" (as in 'massive difference between same function compiled two different ways') it often is not.

(And for reproducibility all my files are in this directory on github.)

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Thank you really much for bearing with me and taking the time to look at the issue. I cloned your repo. Unfortunately, the problem persists but only becomes visible for larger values of `iters`. I ran your script `modifiedScript.R` and changes `iters` in both function calls (in `benchmark`) to `1e4` and `replications` to `1e3`. In this case, I still consistently get a relative runtime of >> 100. Am I missing something? – s1624210 Dec 05 '22 at 20:06
  • I think you insist on measuring the cost of empty function calls and 'empty' (as in compiler-optimized loops). If and and when you actually have a valid example, please free to open an issue at the Rcpp repo, or a new issue here. So far you fail to convince me that there is an issue. As for your new values, I get 3ms elapsed 'inPkg' and 13ms elapsed 'outsidePkg' with a relative gain of 4.3 for code in the package. But that is just measuring noise. – Dirk Eddelbuettel Dec 05 '22 at 20:21
  • Per that last comment I just updated the question to measure a non-trivial, non-empty functions. Both approaches take about the same time, which is what you would expect. – Dirk Eddelbuettel Dec 06 '22 at 13:18
  • 1
    I suspected this was this case (i.e. the OP's example wasn't really doing anything). I recall your comment, "_compilers are smart these days_", on a similar question: https://stackoverflow.com/q/41409841/4408538 – Joseph Wood Dec 12 '22 at 17:15