Improve performance when calling a R-package

Question

We have created a R-package that should do near realtime scoring through OpenCPU. The issue now is that we are having a very large overhead calling our package. The R part is executed quite fast so the overhead is before and after R is initialized.

The R package contain two modelobejcts (100 MB and 40 MB). We can see the poor performance is related to the size of the modelobejcts because performance improves if the objects are smaller.

We have added the package to preload in server.conf and added onLoad <- function(lib, pkg) and lazyload = FALSE.

We have also tried just to save data in inst/extdata and then load data with readRDS(system.file())

We expect from both solutions that the models is cached to memory the first time the package is loaded, and then held in memory, so no reload is done, but that does not seem to work - or it seems there is some overhead on each curl done.

What are we missing here?

The following times is just when I do a httr::GET(url) to the specific package on our opencpu server:

redirect    namelookup    connect   pretransfer starttransfer total 

1.626196    0.000045      0.000049  0.000118    1.633508      3.259843

To compare we get the following when we make a GET to one of the standard packages:

redirect    namelookup       connect   pretransfer starttransfer total 

0.085428      0.000044      0.000049      0.000125      0.046630      0.132217

I am a newbie to this, and not sure what else to do. I can't find anything in the documentation regarding what the times are referring to or when data is cached to memory.

Have you tried adding the package name to the `preload` config parameter in `/etc/opencpu/server.conf`? (Listed in section 3.3 of the [opencpu server manual](https://cran.r-project.org/web/packages/opencpu/vignettes/opencpu-server.pdf).) — r2evans, Mar 22 '17 at 13:14
Thanks for your reply. Yes, it is already added to /etc/opencpu/server.conf (also described in [link](https://www.opencpu.org/posts/scoring-engine/) — leboldt, Mar 22 '17 at 13:31
No we haven't configured this. But I do not understand how that could influence performance. At the moment we are just testing the models, so the amout of curls are very limited as they are done manually. — leboldt, Mar 22 '17 at 14:20
I believe the default MPM for apache is prefork, so unless you've changed it, it is likely to be what you need. The prefork model ensures new processes are already started/ready before that child process will be needed for requests. Since R is being used for each single API/function call, the turnover of child processes can be time consuming, so you want a single child process per R instance, and you want them loaded before client requests will be received. This is likely not your problem, but I don't think you get the same guarantees with other apache multi-proc modules. — r2evans, Mar 22 '17 at 14:28
But I get your point: you don't have many client requests coming in, so perhaps that isn't the issue. — r2evans, Mar 22 '17 at 14:28
I've seen some similar posts gain some performance by putting their large data into databases, whether monolithic/local (sqlite) or something else (rdbms such as sql server, postgres, mariadb, mysql). This has consequences so is not necessarily a guaranteed improvement. — r2evans, Mar 22 '17 at 14:30
We haven't changed it so should just be standard conf settings. But good to keep in mind for the future. The data here is a modelobject (random forest) so unfortunately RDBM wouldn't work (or at least I think it would take a lot of work). But thanks for your input. — leboldt, Mar 23 '17 at 08:20

Improve performance when calling a R-package

0 Answers0