1

Currently I am using foreach loop in R to run parallel function calls on multiple cores of the same machine, and the code looks something like this:

result=foreach(i=1:length(list_of_dataframes))
{
   temp=some_function(list_of_dataframes[[i]])
   return(temp)
}

In this list_of_dataframes each data frame is for one product and has different number of columns, some_function is a function that does modelling task on each of the data frame. There are multiple function calls inside this function, some do data wrangling, others perform some sort of variable selection and so on. The result is a list of lists with each sub-list being a list of 3 data frames.For now, I have hardly 500 products and I am using a 32GB machine with 12 cores to perform this task in parallel using doparallel and foreach. My first question is - how do I scale this up,say when I have 500000 products and which framework should be ideal for this? My second question is- Can I use sparkR for this? Is spark meant to perform tasks like these? Would sparkR.lapply() be a good thing to use? I have read that it should be used as a last resort.

I am very new to all this parallel stuff, any help or suggestions would be of great help.Thanks in advance.

  • Just a quick comment: As it is set up now, you need to pass the whole list of data.frames to each worker. That is a waste of memory and time. You should change `some_function` to accept a data.frame as input and do `foreach(d = list_of_dataframes)`. – Roland Sep 11 '20 at 07:05
  • Hey, made a small mistake in the code, I have updated it now, can you please give a little more clarity if it still wastes time and memory and how to restructure it? – Paras Karandikar Sep 11 '20 at 11:17
  • My previous comment still applies. – Roland Sep 12 '20 at 10:25

0 Answers0