Parallelize user-defined function using apply family in R

Question

I have a script that takes too long to compute and I'm trying to paralellize its execution.

The script basically loops through each row of a data frame and perform some calculations as shown below:

my.df = data.frame(id=1:9,value=11:19)

sumPrevious <- function(df,df.id){
    sum(df[df$id<=df.id,"value"])
}

for(i in 1:nrow(my.df)){
    print(sumPrevious(my.df,my.df[i,"id"]))
}

I'm starting to learn to parallelize code in R, this is why I first want to understand how I could do this with an apply-like function (e.g. sapply,lapply,mapply).

I've tried multiple things but nothing worked so far:

mapply(sumPrevious,my.df,my.df$id) # Error in df$id : $ operator is invalid for atomic vectors

`lapply`, `sapply`, `mapply` do not perform operations in parallel. They run a function serially. Do you want to just replace `for loop` with `lapply` or you want to run your code in parallel? — tushaR, Aug 03 '17 at 04:31
It seems odd that you need parallelization for this? What you're trying to achieve is just a `cumsum(my.df$value)` for me, assuming that `my.df$id` is sorted. — F. Privé, Aug 03 '17 at 07:04

score 5 · Accepted Answer · answered Aug 03 '17 at 04:36

Using theparallel package in R you can use the mclapply() function. You will need to adjust your code a little bit to make it run in parallel.

library(parallel)
my.df = data.frame(id=1:9,value=11:19)

sumPrevious <- function(i,df){df.id = df$id[i]
    sum(df[df$id<=df.id,"value"])
}

mclapply(X = 1:nrow(my.df),FUN = sumPrevious,my.df,mc.preschedule = T,mc.cores = no.of.cores)

This code will run the sumPrevious in parallel on no.of.cores in your machine.

score 2 · Answer 2 · answered Aug 03 '17 at 04:43

Well, this is fun playing with. you kind need something like below:

 mapply(sumPrevious,list(my.df),my.df$id)

For supply, since the first input is the dataframe, you will have to define a given function for it to be ale to recognize it so:

  sapply(my.df$id,function(x,y) sumPrevious(y,x),my.df)

I prefer mapply here since we can set the first value to be imputed as the dataframe directly. But the whole of the dataframe. That's why you have to use the function list.

Map ia a wrapper of mapply and thus would just present the solution in a list format. try it. Also lapply is similar to sapply only that sapply would have to simplify the results into an array format while lapply would give the same results as a list.

Though it seems whatever you are trying to do can simply be done by a cumsum function.

 cumsum(df$values)

Parallelize user-defined function using apply family in R

2 Answers2