-2

Currently, I am using foreach loop from doparallel library to run function calls in parallel across multiple cores of the same machine, which looks something like this:

out_results=foreach(i =1:length(some_list))%dopar%
{
   out=functions_call(some_list[[i]])
   return(out)
}

This some_list is a list of data frames and each data frame would have different number of columns, the function_call() is a function that does multiple things to the data such as data manipulations,then uses random forest for variable selection and then finally performs a least squares fit. The variable out is again a list of 3 data frames, and out_results will be a list of lists. I am using CRAN libraries and some custom libraries created by me inside the function call, I want to avoid using spark ML libraries due to their limited functionality and re-writing of the entire code.

I want to leverage spark for running these function calls in parallel. Is it possible to do so? If yes in which direction should I be thinking. I have read a lot of documentation from sparklyr, but it doesn't seem to help much since the examples provided there are very straightforward.

  • 1
    One second of googling will give you the way to go. What have you tried already? Why doesn't it work? – Pierre Gramme Sep 03 '20 at 11:23
  • @PierreGramme can you suggest something? I know I had put up the previous question in very vague manner and we can use spark_apply function to do the regression thing in parallel but my real question was the one I have edited. Apologies for the misunderstanding. – Paras Karandikar Sep 08 '20 at 05:50

1 Answers1

0

SparklyR's homepage gives examples of arbitrary R code distributed on the Spark cluster. In particular, see their example with grouped operations.

Your main structure should be a data frame, which you will process rowwise. Probably something like the following (not tested):

some_list = list(tibble(a=1[0]), tibble(b=1), tibble(c=1:2))
all_data = tibble(i = seq_along(some_list), df = some_list)

# Replace this with your actual code. 
# Should get one dataframe and produce one dataframe. 
# Embedded dataframe columns are OK
transform_one = function(df_wrapped) {
  # in your example, you expect only one record per group
  stopifnot(nrow(df_wrapped)==1)
  df = df_wrapped$df
  
  res0 = df
  res1 = tibble(x=10)
  res2 = tibble(y=10:11)
  
  return(tibble(res0 = list(res0), res1 = list(res1), res2 = list(res2)))
}

all_data %>% spark_apply(
  transform_one,
  group_by = c("i"), 
  columns = c("res0"="list", "res1"="list", "res2"="list"),
  packages = c("randomForest", "etc")
)

All in all, this approach seems unnatural, as if we were forcing the use of Spark on a task which does not really fit. Maybe you should check for another parallelization framework?

Pierre Gramme
  • 1,209
  • 7
  • 23
  • I am not sure if looking at another framework is possible since the machine has very limited permissions.I tried spark_apply a couple of days back, the problem is it doesn't allow me to nest functions, don't know if I am doing something wrong but if there is another function call inside transform_one it will throw an error.On the other hand if I use spark.lapply from sparkR it works fine with nested calls and doesn't need a spark_tbl input.One more thing that I am confused about is, if I have 10 workers and this function needs to be called 100 times wouldn't it split 10 calls to each workers? – Paras Karandikar Sep 10 '20 at 14:32
  • Also, I am very new to spark, apologies in advance for any dumb questions. – Paras Karandikar Sep 10 '20 at 14:33
  • The problems you are describing now are quite different from your initial question. Please ask a new question – Pierre Gramme Sep 10 '20 at 18:27