4

I just started testing Spark R 2.0, and find the execution of dapply very slow.

For example, the following code

set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])

user  system elapsed 
0.005   0.000   0.006 `

is executed in 6ms

Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:

sparkR.session(master = "local[4]")

random_DF_Spark <- repartition(createDataFrame(random_DF),4)

subset_DF_Spark <- dapply(
    random_DF_Spark,
    function(x) {
        y <- x[x[1] > 1, ]
        y
    },
    schema(random_DF_Spark))

system.time(dummy_res_Spark<-collect(subset_DF_Spark))

user  system elapsed 
2.003   0.119  62.919 

I.e. 1 minute, which is abnormally slow.... Am I missing something?

I get also a warning (TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low?

I am using R 3.3.0 on Mac OS 10.10.5

Any insight welcome!

0 Answers0