Currently I am using foreach loop in R to run parallel function calls on multiple cores of the same machine, and the code looks something like this:
result=foreach(i=1:length(list_of_dataframes))
{
temp=some_function(list_of_dataframes[[i]])
return(temp)
}
In this list_of_dataframes each data frame is for one product and has different number of columns, some_function is a function that does modelling task on each of the data frame. There are multiple function calls inside this function, some do data wrangling, others perform some sort of variable selection and so on. The result is a list of lists with each sub-list being a list of 3 data frames.For now, I have hardly 500 products and I am using a 32GB machine with 12 cores to perform this task in parallel using doparallel and foreach. My first question is - how do I scale this up,say when I have 500000 products and which framework should be ideal for this? My second question is- Can I use sparkR for this? Is spark meant to perform tasks like these? Would sparkR.lapply() be a good thing to use? I have read that it should be used as a last resort.
I am very new to all this parallel stuff, any help or suggestions would be of great help.Thanks in advance.