Running parallel function calls with sparklyr

Question

Currently, I am using foreach loop from doparallel library to run function calls in parallel across multiple cores of the same machine, which looks something like this:

out_results=foreach(i =1:length(some_list))%dopar%
{
   out=functions_call(some_list[[i]])
   return(out)
}

This some_list is a list of data frames and each data frame would have different number of columns, the function_call() is a function that does multiple things to the data such as data manipulations,then uses random forest for variable selection and then finally performs a least squares fit. The variable out is again a list of 3 data frames, and out_results will be a list of lists. I am using CRAN libraries and some custom libraries created by me inside the function call, I want to avoid using spark ML libraries due to their limited functionality and re-writing of the entire code.

I want to leverage spark for running these function calls in parallel. Is it possible to do so? If yes in which direction should I be thinking. I have read a lot of documentation from sparklyr, but it doesn't seem to help much since the examples provided there are very straightforward.

One second of googling will give you the way to go. What have you tried already? Why doesn't it work? — Pierre Gramme, Sep 03 '20 at 11:23
@PierreGramme can you suggest something? I know I had put up the previous question in very vague manner and we can use spark_apply function to do the regression thing in parallel but my real question was the one I have edited. Apologies for the misunderstanding. — Paras Karandikar, Sep 08 '20 at 05:50

score 0 · Answer 1 · answered Sep 09 '20 at 12:30

SparklyR's homepage gives examples of arbitrary R code distributed on the Spark cluster. In particular, see their example with grouped operations.

Your main structure should be a data frame, which you will process rowwise. Probably something like the following (not tested):

some_list = list(tibble(a=1[0]), tibble(b=1), tibble(c=1:2))
all_data = tibble(i = seq_along(some_list), df = some_list)

# Replace this with your actual code. 
# Should get one dataframe and produce one dataframe. 
# Embedded dataframe columns are OK
transform_one = function(df_wrapped) {
  # in your example, you expect only one record per group
  stopifnot(nrow(df_wrapped)==1)
  df = df_wrapped$df
  
  res0 = df
  res1 = tibble(x=10)
  res2 = tibble(y=10:11)
  
  return(tibble(res0 = list(res0), res1 = list(res1), res2 = list(res2)))
}

all_data %>% spark_apply(
  transform_one,
  group_by = c("i"), 
  columns = c("res0"="list", "res1"="list", "res2"="list"),
  packages = c("randomForest", "etc")
)

All in all, this approach seems unnatural, as if we were forcing the use of Spark on a task which does not really fit. Maybe you should check for another parallelization framework?

I am not sure if looking at another framework is possible since the machine has very limited permissions.I tried spark_apply a couple of days back, the problem is it doesn't allow me to nest functions, don't know if I am doing something wrong but if there is another function call inside transform_one it will throw an error.On the other hand if I use spark.lapply from sparkR it works fine with nested calls and doesn't need a spark_tbl input.One more thing that I am confused about is, if I have 10 workers and this function needs to be called 100 times wouldn't it split 10 calls to each workers? — Paras Karandikar, Sep 10 '20 at 14:32
Also, I am very new to spark, apologies in advance for any dumb questions. — Paras Karandikar, Sep 10 '20 at 14:33
The problems you are describing now are quite different from your initial question. Please ask a new question — Pierre Gramme, Sep 10 '20 at 18:27

Running parallel function calls with sparklyr

1 Answers1