Why is this Rcpp code slower than byte compiled R?

Question

As the question title says, I'd like to know why the byte compiled R code (using compiler::cmpfun) is faster than equivalent Rcpp code for the following mathematical function:

func1 <- function(alpha, tau, rho, phi) {
     abs((alpha + 1)^(tau) * phi - rho * (1- (1 + alpha)^(tau))/(1 - (1 + alpha)))
}

Since this is a simple numerical operation, I would have expected Rcpp (funcCpp and funcCpp2) to be much faster than the byte compiled R (func1c and func2c), especially since R would have more overhead for storing (1+alpha)**tau or needs to recompute it. In fact computing this exponent two times seems faster than the memory allocation in R (func1c vs func2c), which seems especially counterintuitive, since n is large. My other guess is that maybe compiler::cmpfun is pulling off some magic, but I'd like to know if that is indeed the case.

So really, the two things I'd like to know are:

Why are funcCpp and funcCpp2 slower than func1c and func2c? (Rcpp slower than compiled R functions)
Why is funcCpp slower than func2? (Rcpp code slower than pure R)

FWIW, here's my C++ and R version data

user% g++ --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.0 (clang-700.0.72)
Target: x86_64-apple-darwin14.3.0
Thread model: posix

user% R --version
R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin14.5.0 (64-bit)

And here's the R and Rcpp code:

library(Rcpp)
library(rbenchmark)

func1 <- function(alpha, tau, rho, phi) {
    abs((1 + alpha)^(tau) * phi - rho * (1- (1 + alpha)^(tau))/(1 - (1 + alpha)))
}

func2 <- function(alpha, tau, rho, phi) {
    pval <- (alpha + 1)^(tau)
    abs( pval * phi - rho * (1- pval)/(1 - (1 + alpha)))
}

func1c <- compiler::cmpfun(func1)
func2c <- compiler::cmpfun(func2)

func3c <- Rcpp::cppFunction('
    double funcCpp(double alpha, int tau, double rho, double phi) {
        double pow_val = std::exp(tau * std::log(alpha + 1.0));
        double pAg = rho/alpha;
        return std::abs(pow_val * (phi -  pAg) + pAg);
    }')

func4c <- Rcpp::cppFunction('
    double funcCpp2(double alpha, int tau, double rho, double phi) {
        double pow_val = pow(alpha + 1.0, tau) ;
        double pAg = rho/alpha;
        return std::abs(pow_val * (phi -  pAg) + pAg);
    }')

res <- benchmark(
           func1(0.01, 200, 100, 1000000),
           func1c(0.01, 200, 100, 1000000),
           func2(0.01, 200, 100, 1000000),
           func2c(0.01, 200, 100, 1000000),
           func3c(0.01, 200, 100, 1000000),
           func4c(0.01, 200, 100, 1000000),
           funcCpp(0.01, 200, 100, 1000000),
           funcCpp2(0.01, 200, 100, 1000000),
           replications = 100000,
           order='relative',
           columns=c("test", "replications", "elapsed", "relative"))

And here's the output of rbenchmark:

                             test replications elapsed relative
   func1c(0.01, 200, 100, 1e+06)       100000   0.349    1.000
   func2c(0.01, 200, 100, 1e+06)       100000   0.372    1.066
 funcCpp2(0.01, 200, 100, 1e+06)       100000   0.483    1.384
   func4c(0.01, 200, 100, 1e+06)       100000   0.509    1.458
    func2(0.01, 200, 100, 1e+06)       100000   0.510    1.461
  funcCpp(0.01, 200, 100, 1e+06)       100000   0.524    1.501
   func3c(0.01, 200, 100, 1e+06)       100000   0.546    1.564
    func1(0.01, 200, 100, 1e+06)       100000   0.549    1.573K

That `func1c` is faster than `func2c` is likely due to the garbage collector or other hard-to-determine causes. Run your `benchmark` call several times and you'll see those two functions switch places in the ranking. Trying to measure something that takes only 1 microsecond to run is very difficult. — Joshua Ulrich, Oct 15 '15 at 23:50
@bunk “Is it really that surprising C functions are faster than C++?” — Yes, absolutely. There’s no reason to expect this. In fact, C++ code can often be made to be *faster* than C code of the same abstraction level, and should never be slower. Here’s more info: http://programmers.stackexchange.com/a/29136/2366 — Konrad Rudolph, Oct 16 '15 at 08:32
@bunk True, it’s twelve vs a dozen. There should be no difference either way. The “magic” that Rcpp spins may of course incur an overhead — however, is `UN`/`PROTECT` even called here? My knowledge of R’s C API is sketchy at best but since these are all arguments there should be no protection necessary, should there? — Konrad Rudolph, Oct 16 '15 at 09:11
@bunk: `PROTECT` is only necessary if it's possible that the garbage collector runs while you're still using the object. See [section 5.9.1 in Writing R Extensions](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Garbage-Collection). — Joshua Ulrich, Oct 16 '15 at 13:37
@bunk: I was only referring to your function, where you only call `allocVector` once. You don't need the `res` variable, since you can wrap your `fabs` call in `ScalarReal`: `return ScalarReal(fabs(...));`. — Joshua Ulrich, Oct 16 '15 at 14:11
@JoshuaUlrich oh man that's awesome! I'm just learning the API and was wondering why I had to always be allocating results — Rorschach, Oct 16 '15 at 14:13
@JoshuaUlrich The overhead from the garbage collector makes sense. Thanks for that explanation! And looking at the disassemble output, func1c and func2c have a difference in about 8 or some op codes, so my original assumption that there would be significant overhead with 1c was completely wrong. Moral: Luke Tierney did a _really_ good job with the byte compiler. — lostinarandomforest, Oct 16 '15 at 15:49
@JoshuaUlrich I finally got around to looking at the macro, and it just does the allocVector, creating a res variable anyways. It doesn't use UN/PROTECT either. — Rorschach, Oct 23 '15 at 21:22

score 6 · Accepted Answer · answered Oct 16 '15 at 14:23

6

This is essentially an ill-posed question. When you posit

func1 <- function(alpha, tau, rho, phi) {
     abs((alpha + 1)^(tau) * phi - rho * (1- (1 + alpha)^(tau))/(1 - (1 + alpha)))
}

without even specifying what the arguments are (ie scalar? vector? big? small? memory overhead) then you may in the best case just get a small set of (base, efficient) function calls directly from the parsed expression.

And ever since we've had the byte compiler, which was since improved by Luke Tierney in subsequent R releases, we have known that it does algebraic expressions well.

Now, compiled C/C++ code does that well too -- but there will be overhead in calling the compiled coed and what you see here is that for "rtivial enough" problems, the overhead does not really get amortized.

So you end up with pretty much a draw. Not surprise as far as I can tell.

answered Oct 16 '15 at 14:23

Dirk Eddelbuettel

360,940
56
644
725

“but there will be overhead in calling the compiled coed” — I think this is the crux of the problem. How much overhead is there? Because calling a conventional R function of course also carries an overhead, and *all* the individual mathematical operations in the above R function incur this overhead. So it should be *much* bigger in, in sum, than the single overhead of calling the C++-compiled function once. – Konrad Rudolph Oct 16 '15 at 14:29
Why don't you measure it? Start with `Rcpp::cppFunction('void empty() {} ')` and go from there... – Dirk Eddelbuettel Oct 16 '15 at 14:50
That’s not the point. If you say this, I believe you. I’d like to understand *why*, and measuring doesn’t help me there. – Konrad Rudolph Oct 16 '15 at 15:05
2

You are reading too much into this. A native functions calls a primitive. A function generated by Rcpp and Rcpp Attributes will have one or tw _very skinny_ wrappers and indirection. These do not matter in non-pathological examples, but this question is pathological. – Dirk Eddelbuettel Oct 16 '15 at 15:11
Ah, that was my misunderstanding. Thanks. So the wrapper is essentially the same that user “bunk” was discussing in the comments under the question (now deleted). Of course it makes sense that this incurs a penalty (and also that that penalty is usually negligible). – Konrad Rudolph Oct 16 '15 at 15:21
Yup and yup (though I never saw what @bunk wrote). – Dirk Eddelbuettel Oct 16 '15 at 15:31
@DirkEddelbuettel So if I understand you correctly, you mean that the byte compiler is compiling to what it expects is the most efficient base case. And if I tested it with say vectors instead of scalars, I might see different performance because the byte compiler would not have optimized for that. On the other hand, it's interesting that the overhead in the C++ call was substantial enough, to drag performance down that much, but it might be a case of the problem of benchmarking microsecond-speed code. – lostinarandomforest Oct 16 '15 at 15:58
1

No, that is not really what I was saying. – Dirk Eddelbuettel Oct 16 '15 at 19:40

Why is this Rcpp code slower than byte compiled R?

1 Answers1