0

I have a problem with the R server. I have a dataset with 25000 rows and I want to impute the missing values. When I test this on my local computer (Windows 7, R version 3.4) there is no problem. It takes a few minutes to run the code:

mice(data = result, m=2, method = "rf")

When I run the same code with the same dataset on a different server it takes hours. I tested it on the following servers:

  • Suse Linux Enterprise server 11 SP4 with R version 3.2.3
  • Suse Linux Enterprise server 11 SP4 with R version 3.4
  • Windows Server 2012 R version 3.2.3

I need to run the code on one of these server, because originally I want to use the code from the SAP HANA.

Is there a specific configuration for my need?

Ulysse BN
  • 10,116
  • 7
  • 54
  • 82
Steffi K
  • 3
  • 1
  • Are you using the standard version of R on your Windows server or a multi-threaded one? Does the problem reproduce when you run it on the slow systems without RSERVE but directly in R? Can you profile the slow running code? – Lars Br. Sep 04 '17 at 23:13
  • I used the standard Version. Because on my lokal computer I use the same and there it works well. What do you mean with profile the code? – Steffi K Sep 05 '17 at 06:20
  • Profiling the code means to find out how long each step and the code as a whole takes to run. There are several approaches for that with R, but I would start with: `ptm <- proc.time() Rprof(tmp <- tempfile()) mice(data= result ,m=2, method="rf") Rprof() summaryRprof(tmp) proc.time() - ptm` Compare the output of this on the linux system with your local computer. – Lars Br. Sep 05 '17 at 08:28
  • Thanks. I profiled the code. First time on my local computer and the second time on HANA. local: User ( 536.22 ) System ( 7.30) elapsed( 591.51 ) HANA: User (1801 ) System (6,18) elapsed (1807) How does this information helps? – Steffi K Sep 05 '17 at 21:14
  • Did both tries work on the exact same data set? Did you run the timing on an R model in HANA or directly in R on the Linux server that you want to connect to HANA? What is the processor speed/type for both your local computer and the Linux server? The timing information only shows that the majority of the time is spent with processing "user" space code, which in this case matches to the R process. Can you also provide the RProf output? That will break down the user time to single R functions. – Lars Br. Sep 06 '17 at 01:05
  • I tried both with the same Dataset. I run the example on an R model in HANA, on the Linux Server from the R gui and on my local computer. When I run the Code directly on the Linux Server the code is very fast, same like local. I can give you the RProf() output but you will see only one line. because the code example abpove ist everything I try. – Steffi K Sep 06 '17 at 11:54
  • This is the RProf() output from the HANA:(by.self = list(self.time = 1816.66, self.pct = 100, total.time = 1816.66, total.pct = 100), by.total = list(total.time = 1816.66, total.pct = 100, self.time = 1816.66, self.pct = 100), by.line = list(self.time = 1816.66, self.pct = 100, total.time = 1816.66, total.pct = 100), sample.interval = 0.02, sampling.time = 1816.66) – Steffi K Sep 06 '17 at 11:54
  • Ok, so now we know two important things: 1. running the code directly in R (RStudio) is fast and 2. running the code via RServe and then R is slow. The important bit here is that the R that gets invoked by RServe tells us that it is using more than triple the time than the R from RStudio. So, the next step would be to check where the difference between the two R instances (started from RStudio and RServe) is. Maybe some linux-user resource limitations? Maybe a different R version? – Lars Br. Sep 08 '17 at 02:56
  • Anything beyond that would require - for me at least - that I can reproduce this behavior and for that, I'd need test data. So, if you can provide a test data set and a complete description how to produce the effect I could invest a bit more time and look into this on my setup. – Lars Br. Sep 08 '17 at 02:57

0 Answers0