1

I have some functions that are being executed in ordinal form, that is, one is executed and then the next one and so on. Due to computational time, I am interested in parallelizing such functions in R. Performing a search here in the forum I had access to some information contained in the following topic: Run several R functions in parallel.

However, I confess that I am a beginner in this area of data science and I have never done parallel programming. My goal is fully nested to the fact that I'm joining some databases via the left_join() function and for each of those joins I'm applying six different functions. After that, I'm creating some variables and storing all this in a data.frame, see the following computational routine:

Datajoin <- function(Data1, 
                     Data2, 
                     Data3){
fnctions <- Fun1(Data1, Data3) %>% 
left_join(Fun2(Data1)) %>% 
left_join(Fun3(Data1, Data2, Data3)) %>% 
left_join(Fun4(Data1)) %>% 
left_join(Fun5(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>% 
left_join(Fun6(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>% 
cbind(Data2 %>% 
select(St_ab, Var2) %>% 
mutate(Var2 = factor(Var2, levels = unique(Data3$Var2))) %>% 
{model.matrix(~ -1 + .$Var2) %>% 
as.data.frame}) %>% 
left_join(Data2 %>% 
as.data.frame %>% 
                select(St_ab, Var3,
                contains("VarA"), 
                contains("VarB"), 
                contains("VarC"),
                contains("VarD")) %>% 
                mutate(Var4 = year(Var3)), by = c('St_ab_Data1' = 'St_ab'))
   names(fnctions)[names(fnctions) == "St_ab_Data1"] <- 'St_id'
   return(fnctions)
}

As I already mentioned, my big problem is associated with computational time, given that the databases are gigantic, that is, there is a computational inefficiency associated with this process. How could I perform such a procedure by paralleling the work of these functions, given the above computational structure?

I tried to do it as follows, based on the forum post highlighted at the beginning of this post:

Datajoin <- function(Data1, 
                     Data2, 
                     Data3){
fnctions <- Fun1(Data1, Data3) %>% 
left_join(Fun2(Data1)) %>% 
left_join(Fun3(Data1, Data2, Data3)) %>% 
left_join(Fun4(Data1)) %>% 
left_join(Fun5(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>% 
left_join(Fun6(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>% 
cbind(Data2 %>% 
select(St_ab, Var2) %>% 
mutate(Var2 = factor(Var2, levels = unique(Data3$Var2))) %>% 
{model.matrix(~ -1 + .$Var2) %>% 
as.data.frame}) %>% 
left_join(Data2 %>% 
as.data.frame %>% 
                select(St_ab, Var3,
                contains("VarA"), 
                contains("VarB"), 
                contains("VarC"),
                contains("VarD")) %>% 
                mutate(Var4 = year(Var3)), by = c('St_ab_Data1' = 'St_ab'))
   names(fnctions)[names(fnctions) == "St_ab_Data1"] <- 'St_id'
   return(fnctions)
}

tasks1 = list(wrk1 = function(x) Datajoin(x))
library(paralell)
clus = makeCluster(6)
clusterExport(clus, c('Datajoin', 
                    'Data1', 'Data2', 'Data3'))

outPUT = clusterApply( 
  clus,
  tasks1
)
stopCluster(clus)

However, in the same way that an error does not appear, it does not release the desired one.

Note: my Fun1 - Fun6 functions are their own functions, but I can't share them in this topic. I tried to adapt the post to be reproducible by introducing RBase functions but I came across a series of errors, I apologize in advance.

user55546
  • 37
  • 1
  • 15
  • 1
    R is case sensitive, instead of `clusterExport(clUS, etc)` it should be `clusterExport(clus, etc)`. – Rui Barradas Nov 09 '22 at 15:04
  • 4
    I don't think you're going to be able to parallelise this by simply wrapping it in a `clusterApply()`. You have written this as a long pipe, i.e. one big function call stack, so the functions need to be evaluated in sequence before the enclosing function can be executed. You could potentially hive out some of the internal function calles e.g. `Fun2(Data1)` to different processes and then call the result in this function. But before trying this in parallel, I'd rewrite it in `data.table` - all these joins should be a lot faster if you update by reference rather than copy huge datasets. – SamR Nov 09 '22 at 15:08
  • @RuiBarradas sorry, it was just a mistake to make the code available here on the forum, I already corrected it. – user55546 Nov 09 '22 at 15:35
  • I second the recommendation to switch to data.table. Among other advantages, data.table is parallelized internally. – Roland Nov 16 '22 at 08:15
  • General advice: It seems like you do a lot of filtering within your pipe., and perhaps you have already imported this data from an external SQL server (or similar). For optimizations general rule of thumb: Filter your data as early as possible (work with the least amount of data necessary), and keep your transformations within the database if it is viable from an efforts perspective. You could potentially also save time by ordering your joins intelligently, if you have prior information surrounding which you expect to be bigger (join smaller ones first, then bigger ones). – Oliver Nov 17 '22 at 09:01

1 Answers1

3

Given the example, it should be possible to precompute the results of Fun1...Fun6 (in parallel) before joining their results. The following is an example using the furrr package, which is a drop-in for purrr, but uses the future package for parallization. More details available from here.

reproducible example

This sets up a trivial example that broadly follows the logic of the question. It executes 3 simple functions on the iris dataset, each of which has a duration of ~2 seconds, and then joins the results.

library(dplyr)
library(tictoc) # for timing

df1 <- df2 <- df3 <- iris %>% mutate(id = row_number())

fun1 <- function(df1, df2) {
  Sys.sleep(2)
  df1 %>% select(Sepal.Length, id) %>%
    left_join(df2 %>% select(Sepal.Width, id), by = "id")
}

fun2 <- function(df) {
  Sys.sleep(2)
  df %>% select(Species, id)
}

fun3 <- function(df1, df2) {
  Sys.sleep(2)
  df1 %>% select(Petal.Length, id) %>%
    left_join(df2 %>% select(Petal.Width, id), by = "id")
}

data_join_function <- function(df1, df2, df3) {
  
  fun1(df1, df2) %>%
    left_join(fun2(df1), by = "id") %>%
    left_join(fun3(df1, df3), by = "id")
  
}

tic()
df_joint <- data_join_function(df1, df2, df3)
toc()

# 6.17 sec elapsed

rewrite for furrr

Here we re-write the procedure to enable parallelization. We create lists of functions and corresponding function arguments, then use furrr::future_map2 to iterate over both lists just as with purrr::map2. This returns a list of the function results, which we finally join with the original logic.

library(furrr) 

data_join_function_furrr <- function(df1, df2, df3) {
  
  # collect arguments for each function
  args_list <- list(
    list(df1, df2),
    list(df1),
    list(df1, df3)
  )
  
  # list of functions
  funcs <- list(fun1, fun2, fun3)
  
  # iterate over functions and arguments a la purrr::map2
  func_results <- furrr::future_map2(
    funcs, args_list, function(func, args){
      do.call(func, args)
    }
  )
  
  # join the results
  func_results[[1]] %>%
    left_join(func_results[[2]], by = "id") %>%
    left_join(func_results[[3]], by = "id")
  
}

tic()
df_joint_furrr <- data_join_function_furrr(df1, df2, df3)
toc()

# 6.21 sec elapsed

parallize

So far, this has not improved the performance. The final piece is to set up multiple workers, which furrr can make use of. This now reduces the runtime as expected by a factor of 3 (except for the first run, likely because of some overhead setting up the workers).

future::plan(multisession, workers = 3)


for (i in 1:5) {
  tic()
  df_joint_furrr <- data_join_function_furrr(df1, df2, df3)
  toc()
}
# 2.81 sec elapsed
# 2.09 sec elapsed
# 2.11 sec elapsed
# 2.08 sec elapsed
# 2.06 sec elapsed
  
pholzm
  • 1,719
  • 4
  • 11
  • 2
    `furrr` looks like a good solution to your reproducible example. I wonder if you have addressed the "gigantic" data issue. As it stands the question leaves us guessing as to what the inner functions return, but in the event that they are large data frames, the main cost might be copies made by the functions or in the `dplyr` pipe. Passing the data between parallel workers could be more expensive than the single-threaded code... – SamR Nov 13 '22 at 17:02
  • 1
    Fair points, and just from the question it is not completely clear to me where the actual bottleneck is. So I don't know if or how much that approach improves the original problem. The way I understood the question was that 1) `dplyr` is already capable of handling the "gigantic data " since the original code works; and 2) that it asked to parallelize the Functions Fun1-Fun6. I also think that is likely the only part of the code which CAN be parallelized/improved without knowing more details on the problem and the logic behind the joins. Curious to know if it ultimetly improves the situation. – pholzm Nov 13 '22 at 17:32
  • @SamR so no, I did not address the "gigantic" part of the question. And there's a realistic chance that my answer does not improve on the original problem due to the topics you pointed out. – pholzm Nov 13 '22 at 17:37
  • 2
    I think the question is an XY problem. The real question is, how do I make my code execute more quickly? But the question asked was, how do I do this in parallel? Of course without a reprex, we don't know the bottlenecks, so the only possible answer is along the lines of the one you've posted. And actually your solution is a particularly clear and instructive example. But I am also curious to know whether it will ultimately speed up execution. – SamR Nov 13 '22 at 17:42