Best apply function for a large data set which takes multiple arguments

Question

I got a big data set.

When I used mapply, I got this time for running the code for one instance(I have 400 thousand instances)

user system elapsed 0.49 0.05 0.53 function takes 2 arguments as input.

Got the idea from this link

Is there any apply function ideas for efficiently running the code?

Edit: To give a good idea of code

`A$V1<- sample(50000)
 A$V2<- sample(50000)
output<-mapply(myfun, A$V1, A$V2)

myfun<- function(x,y)
return(length(which(x<=gh2$data_start & y>=gh2$data_end)))`

gh2 is a data frame with 1 billion rows. which function itself consumes 0.30 sec for one search along this big gh2 data frame. Intention is to find how many rows fall into this condition Is there any other efficient way?

If your code is inefficient, just "applying" is not going to help. If you want speedup from parallelizing the problem, see `mcmapply` in the package `parallel`. This maybe useful too: http://stackoverflow.com/questions/8827437/is-there-an-efficient-way-to-parallelize-mapply — fishtank, Mar 08 '16 at 18:31
Same happened. Achieved a small efficiency leverage. But not much. — Bharath, Mar 08 '16 at 19:11

Ben Bolker · Accepted Answer · 2016-03-09T03:12:14.173

You still haven't told us quite enough to replicate your question, but maybe my example below works. tl;dr I can save about 10% by substituting sum() for length(which()) (I'm very surprised it wasn't more ...) and get a 5-fold speedup using Rcpp.

Generate example data:

set.seed(101)
n1 <- 1e4; n2 <- 1e3  
gh2 <- data.frame(data_start=rnorm(n1),data_end=rnorm(n1))

Try out both regular data frames and tbl_df from dplyr (also, data_frame is marginally more convenient for generating data since it allows on-the-fly transformation).

library("dplyr")
A <- data_frame(V1=rnorm(n2),
                V2=V1+runif(n2))
A0 <- as.data.frame(A)

Original function and base-R alternative using sum():

fun1 <- function(x,y)
    return(length(which(x<=gh2$data_start & y>=gh2$data_end)))
fun2 <- function(x,y)
    return(sum(x<=gh2$data_start & y>=gh2$data_end))

check:

all.equal(with(A0, mapply(fun1, V1, V2)),
          with(A, mapply(fun2, V1, V2)))  ## TRUE

Now an Rcpp version. This could almost certainly be shortened/made slicker, but I'm not very experienced with this framework (unlikely to make a huge speed difference, though).

library("Rcpp")
cppFunction("
NumericVector fun3(NumericVector d_start, NumericVector d_end,
                     NumericVector lwr, NumericVector upr) {
   int i, j;
   int n1 = lwr.size();
   int n2 = d_start.size();

   NumericVector res(n1);

   for (i=0; i<n1; i++) {
       res[i]=0;
       for (j=0; j<n2; j++) {
            if (lwr[i]<=d_start[j] && upr[i]>=d_end[j]) res[i]++;
       }
   }
   return res;
}
")

check:

f3 <- fun3(gh2$data_start,gh2$data_end, A$V1,A$V2)
f1 <- with(A0, mapply(fun1, V1, V2))
all.equal(f1,f3)  ## TRUE

Benchmark:

library(rbenchmark)
benchmark(fun1.0= with(A0, mapply(fun1, V1, V2)),
          fun2.0= with(A0, mapply(fun2, V1, V2)),  ## data.frame
          fun2  = with(A, mapply(fun2, V1, V2)),   ## dplyr-style
          fun3 = fun3(gh2$data_start,gh2$data_end, A$V1,A$V2),
          columns=c("test", "replications", "elapsed", "relative"),
          replications=30
          )
##     test replications elapsed relative
## 1 fun1.0           30   7.813    5.699
## 3   fun2           30   6.834    4.985
## 2 fun2.0           30   6.841    4.990
## 4   fun3           30   1.371    1.000

not much difference between data.frame and tbl_df
sum() is 12% faster than length(which())
Rcpp is about 5x faster than base R

This could in principle be combined with parallel::mcmapply:

mcmapply(fun3,gh2$data_start,gh2$data_end, A$V1,A$V2,
                  mc.cores=4)

but for the sizes in the example above the overhead is too high to make it worthwhile.

Best apply function for a large data set which takes multiple arguments

1 Answers1