I have 1000 csv files in my working directory and each file has a location Id, rainfall and temperature. The structure of one file is shown below:
set.seed(123)
my.dat <- data.frame(Id = rep(1, each = 365),
rain = runif(365, min = 0, max = 20),
tmean = sample(20:40, 365, replace = T))
I wrote an Rcpp function that is also stored in my working directory. This function takes in rainfall and temperature data and calculates some derived variables var1
andvar2
. I want to read each location's weather data and apply the function and save the corresponding output using foreach package.
location.vec <- 1:1000
myClusters <- makeCluster(6)
registerDoParallel(myClusters)
foreach(i = 1:length(location.vec),
.packages = c('Rcpp', 'dplyr', 'data.table'),
.noexport = c('myRcppFunc'),
.verbose = T) %dopar%
{
Rcpp::sourceCpp('myRcppFunc.cpp')
idRef <- location.vec[i]
# read the weather data
temp_weather <- fread(paste0('weather_',idRef,'.csv'))
# apply my Rcpp function
temp_weather[, c("var1","var2") := myRcppFunc(rain, tmean)]
# save my output
fwrite(temp_weather, 'paste0('weather_',idRef_modified,'.csv')')
}
stopCluster(myClusters)
This loop seems to have a weird behaviour. Sometimes it just gets stuck on iteration 10, sometimes on 40 etc everytime I run it and then I have to kill the job.
My doubt is this driven by the fact that multiple process are trying to access the Rcpp function at the same time which is leading to this issue? How can I fix it? Can I read in the Rcpp function in the foreach argument so that I don't have to keep loading it? Any other advise?
Thanks