0

This question is specifically about the use of multiple cores to run a given function where the function requires a package and additional arguments to run.

I have a large dataset of the following form:

Event_ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Type=c("A","B","C","D","E","A","B","C","D","E","A","B","C","D")
Revenue1=c(24,9,51,7,22,15,86,66,0,57,44,93,34,37)
Revenue2=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(Event_ID,Type,Revenue1,Revenue2)

I have a fairly complex function and am trying run it on multiple core. Below I present a simple function where the function essentially takes the sum of two columns and subtracts the value of the multiplication of two matrices (My apologies if the function is over-simplified but I am trying to understand how parallel processing works). Heres the function below:

set.seed(100)
library(truncnorm)
alpha_old=matrix(c(1,5),nrow=1)
library(truncnorm)
Total_Revenue=function(data,alpha_old){
  for (i in 1:nrow(z)){
    beta_old=matrix(rtruncnorm(2,a=1,b=10,mean =5,sd=1),ncol=1) #generates beta for each row
    adjustment_factor = alpha_old%*%beta_old #computes adjustment factor for each row
    z[i,'Total_Rev'] = z[i,'Revenue1']+z[i,'Revenue2']-adjustment_factor 
  }
  return(z)
}
Total = Total_Revenue(data=z,alpha=alpha_old)
print(Total)

Running the function regularly and printing the results out provides the expected output (output shown at the end).

Now I want to implement the following using multiple cores using the parSapply. I tried the following:

library(parallel)
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(cores=no_cores)
cl2 <- makeCluster(no_cores)
invisible(clusterEvalQ(cl2, library(truncnorm)))
clusterExport(cl=cl2, varlist=c("alpha_old","z"), envir=environment())
result1 = parSapply(cl2, X= 1:nrow(z),FUN=Total_Revenue,data=z,alpha_old=alpha_old)
stopCluster(cl2)

I get the following message:

Error in checkForRemoteErrors(val) : 14 nodes produced errors; first error: unused argument (X[[i]])

This is the first time I am trying to use multicore processing and am not very familiar with the parallel and doParallel packages. The actual dataset I am working with has around 5 million observation and the function involves additional steps (comparing the values between the other values of the dataset) which I deleted from the example function. Any help on dealing with this will be greatly appreciated. Thanks in advance.

P.S. The output that I get by running the function on one core:

enter image description here

P.P.S. The example data is taken from another question that I had posted here: Gpu processing R (How to use Gpu processing to run a function on subsets of a dataset)

Prometheus
  • 673
  • 3
  • 25

0 Answers0