Why is FASTR (ie GraalVM version of R) 10x slower compared to normal R despite Oracle's claim of 40x faster?

Question

Oracle claims that its graalvm implementaion of R (called "FastR") is up to 40x faster than normal R (https://www.graalvm.org/r/). However, I ran this super simple (but realistic) 4 line test program and not only was GraalVM/FastR not 40x faster, it was actually 10x SLOWER!

x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
  5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
t1 <- system.time(fit1 <- smooth.spline(x,y,spar=0.6))
t1

In FASTR, t1 returns this value:

 user  system elapsed 
  0.870   0.012   0.901

While in the original normal R, I get this result:

 user  system elapsed 
  0.112   0.000   0.113

As you can see, FAST R is super slow even for this simple (ie 4 lines of code, no extra/special library imported etc). I tested this on a 16 core VM on Google Cloud. Thoughts? (FYI: I did a quick peek at the smooth.spline code, and it does call Fortran, but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code.)

====================================

EDIT: Per the comments from Ben Bolker and user438383 below, I modified the code to include a for loop so that the code ran for much longer and I had time to monitor CPU usage. The modified code is below:

x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
  5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)

forloopfunction <- function(xTrain, yTrain) {
  for (x in 1:100) {
  smooth.spline(xTrain, yTrain, spar=0.6)
  }
}

t1 <- system.time(fit1 <-forloopfunction(x,y))
t1

Now, the normal R returns this for t1:

   user  system elapsed 
 19.665   0.008  19.667

while FastR returns this:

 user  system elapsed 
 76.570   0.210  77.918

So, now, FastR is only 4x slower, but that's still considerably slower. (I would be ok with 5% to even 10% difference, but that's 400% difference.) Moreoever, I checked the cpu usage. Normal R used only 1 core (at 100%) for the entirety of the 19 seconds. However, surprisingly, FastR used between 100% and 300% of CPU usage (ie between 1 full core and 3 full cores) during the ~78 seconds. So, I think it fairly reasonably to conclude that at least for this test (which happens to be a realistic test for my very simple scenario), FastR is at least 4x slower while consuming ~1x to 3x more CPU cores. Particularly given that I'm not importing any special libraries which the FASTR team may not have time to properly analyze (ie I'm using just vanilla R code that ships with R), I think that there's something not quite right with the FASTR implementation, at least when it comes to speed. (I haven't tested accuracy, but that's now moot I think.) Has anyone else experienced anything similar or does anyone know of any "magic" configuration that one needs to do to FASTR to get its claimed speeds (or at least similar, ie +- 5% speeds to normal R)? (Or maybe there's some known limitation to FASTR that I may be able to work around, ie don't use normal fortran binaries etc, but use these special ones etc.)

Unsure if this is a reliable benchmark. Did you try using ``microbenchmark`` as a more robust test? — user438383, Jan 22 '22 at 19:59
if the code uses parallelization and the problem is too small, the overhead of parallelizing may be greater than the benefits. — Ben Bolker, Jan 22 '22 at 21:52
also, "up to 40x faster" and "10x slower" aren't actually incompatible ... :-) — Ben Bolker, Jan 22 '22 at 21:52
@BenBolker and user438383, I added a section entitled "EDIT" to my original post that attempts to address your comments. — Jonathan Sylvester, Jan 23 '22 at 00:59
I don't think I can answer this question, but it's interesting. I don't know how widely used this machinery actually is, which might limit the pool of people who can answer ... would digging in with more profiling help? https://medium.com/graalvm/where-has-all-my-run-time-gone-245f0ccde853 — Ben Bolker, Jan 23 '22 at 16:54

score 0 · Answer 1 · answered Jan 24 '22 at 22:59

TL;DR: your example is indeed not the best use-case for FastR, because it spends most of its time in R builtins and Fortran code. There is no reason for it to be slower on FastR, though, and we will work on fixing that. FastR may be still useful for your application overall or just for some selected algorithms that run slowly on GNU-R, but would be a good fit for FastR (loopy, "scalar" code, see FastRCluster package).

As others have mentioned, when it comes to micro benchmarks one needs to repeat the benchmark multiple times to allow the system to warm-up. This is important in any case, but more so for systems that rely on dynamic compilation, like FastR.

Dynamic just-in-time compilation works by first interpreting the program while recording the profile of the execution, i.e., learning how the program executes, and only then compiling the program using this knowledge to optimize it better(*). In case of dynamic languages like R, this can be very beneficial, because we can observe types and other dynamic behavior that is hard if not impossible to statically determine without actually running the program.

It should be now clear why FastR needs few iterations to show the best performance it can achieve. It is true that the interpretation mode of FastR has not been optimized very much, so the first few iterations are actually slower than GNU-R. This is not inherent limitation of the technology that FastR is based on, but tradeoff of where we put our resources. Our priority in FastR has been peak performance, i.e., after a sufficient warm-up for micro benchmarks or for applications that run for long enough time.

To your concrete example. I could also reproduce the issue and I analyzed it by running the program with builtin CPU sampler:

$GRAALVM_HOME/bin/Rscript --cpusampler --cpusampler.Delay=20000 --engine.TraceCompilation example.R
...
-----------------------------------------------------------------------------------------------------------
Thread[main,5,main]
 Name                    ||             Total Time    ||              Self Time    || Location             
-----------------------------------------------------------------------------------------------------------
 order                   ||             2190ms  81.4% ||             2190ms  81.4% || order.r~1-42:0-1567
 which                   ||               70ms   2.6% ||               70ms   2.6% || which.r~1-6:0-194
 ifelse                  ||              140ms   5.2% ||               70ms   2.6% || ifelse.r~1-34:0-1109
...

--cpusampler.Delay=20000 delays the start of sampling by 20 seconds
--engine.TraceCompilation prints basic info about the JIT compilation
when the program finishes, it prints the table from CPU sampler
(example.R runs the micro benchmark in a loop)

One observation is that the Fotran routine called from smooth.spline is not to blame here. It makes sense because FastR runs the very same native Fortran code as GNU-R. FastR does have to convert the data to native memory, but that is probably small cost compared to the computation itself. Also the transition between native and R code is in general more expensive on FastR, but here it does not play a role.

So the problem here seems to be a builtin function order. In GNU-R builtin functions are implemented in C, they basically do a big switch on the type of the input (integer/real/...) and then just execute highly optimized C code doing the work on plain C integer/double/... array. That is already the most effective thing one can do and FastR cannot beat that, but there is no reason for it to not be as fast. Indeed it turns out that there is a performance bug in FastR and the fix is on its way to master. Thank you for bringing our attention to it.

Other points raised:

but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code

YMMV. That concrete benchmark presented at our website does spend considerable amount of time in R code, so the overhead of R<->native transition does not skew the result as much. The best results are when translating the Fortran code to R, so making the whole thing just a pure R program. This shows that FastR can run the same algorithm in R as fast as or quite close to Fortran and that is, performance wise, the main benefit of FastR. There is no free lunch. Warm-up time and the costs of R<->native transition is currently the price to pay.

FastR used between 100% and 300% of CPU usage

This is due to JIT compilations going on on background threads. Again, no free lunch.

To summarize:

FastR can run R code faster by using dynamic just-in-time compilation and optimizing chunks of R code (functions or possibly multiple functions inlined into one compilation unit) to the point that it can get close or even match equivalent native code, i.e., significantly faster than GNU-R. This matters on "scalar" R code, i.e., code with loops. For code that spends majority of time in builtin R functions, like, e.g., sum((x - mean(x))^2) for large x, this doesn't gain that much, because that code already spends much of the time in optimized native code even on GNU-R.
What FastR cannot do is to beat GNU-R on execution of a single R builtin function, which is likely to be already highly optimized C code in GNU-R. For individual builtins we may beat GNU-R, because we happen to choose slightly better algorithm or GNU-R has some performance bug somewhere, or it can be the other way around like in this case.
What FastR also cannot do is speeding up native code, like Fortran routines that some R code may call. FastR runs the very same native code. On top of that, the transition between native and R code is more costly in FastR, so programs doing this transition too often may end up being slower on FastR.
Note: what FastR can do and is a work-in-progress is to run LLVM bitcode instead of the native code. GraalVM supports execution of LLVM bitcode and can optimize it together with other languages, which removes the cost of the R<->native transition and even gives more power to the compiler to optimize across this boundary.
Note: you can use FastR via the cluster package interface to execute only parts of you application.

(*) the first profiling tier may be also compiled, which gives different tradeoffs

Thank you! So, once the "order" function bug is fixed, then FastR should be within 5% (or so) of gnuR for the above prgm (after warmup)? When will the fix for the order function be available(either in beta form or production?)Also, if the FastR function is called from my Java code millions of times (which is why I'm considering using FastR instead of say, RServe) but it's called from different parts of my code (ie it's not just a simple for loop), that should still cause the JIT compiler to eventually kick in for FASTR, right?If so, roughly when does it kick in?After the 1000th call? — Jonathan Sylvester, Jan 25 '22 at 14:53
Yes, if you call the same function from different callsites, it will still be JIT compiled. The default threshold for first tier compilation (faster, but less optimized) is 100 invocations, and the for the second tier it is 1000. You can change that with `--engine.FirstTierCompilationThreshold` and `--engine.LastTierCompilationThreshold`, see `--help:expert`. — Steves, Jan 25 '22 at 16:55
> then FastR should be within 5% it should get closer than what you're seeing. I can't give you a concrete number yet, because it is still in progress. We publish nightly dev builds. Once it's out, I'll update the answer or add a comment. — Steves, Jan 25 '22 at 16:58
In 22.1 we improved/fixed the performance of order and rank builtin functions, which are internally used in `smooth.spline` so the example should give better results. But there is still one other performance issue we know of and plan to address before we match/outperform GNU-R. — Steves, May 02 '22 at 07:33

Why is FASTR (ie GraalVM version of R) 10x slower compared to normal R despite Oracle's claim of 40x faster?

1 Answers1

Linked

Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?

1 Answers1

Linked

Why is FASTR (ie GraalVM version of R) 10x slower compared to normal R despite Oracle's claim of 40x faster?