I have some functions that are being executed in ordinal form, that is, one is executed and then the next one and so on. Due to computational time, I am interested in parallelizing such functions in R. Performing a search here in the forum I had access to some information contained in the following topic: Run several R functions in parallel.
However, I confess that I am a beginner in this area of data science and I have never done parallel programming. My goal is fully nested to the fact that I'm joining some databases via the left_join()
function and for each of those joins I'm applying six different functions. After that, I'm creating some variables and storing all this in a data.frame, see the following computational routine:
Datajoin <- function(Data1,
Data2,
Data3){
fnctions <- Fun1(Data1, Data3) %>%
left_join(Fun2(Data1)) %>%
left_join(Fun3(Data1, Data2, Data3)) %>%
left_join(Fun4(Data1)) %>%
left_join(Fun5(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>%
left_join(Fun6(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>%
cbind(Data2 %>%
select(St_ab, Var2) %>%
mutate(Var2 = factor(Var2, levels = unique(Data3$Var2))) %>%
{model.matrix(~ -1 + .$Var2) %>%
as.data.frame}) %>%
left_join(Data2 %>%
as.data.frame %>%
select(St_ab, Var3,
contains("VarA"),
contains("VarB"),
contains("VarC"),
contains("VarD")) %>%
mutate(Var4 = year(Var3)), by = c('St_ab_Data1' = 'St_ab'))
names(fnctions)[names(fnctions) == "St_ab_Data1"] <- 'St_id'
return(fnctions)
}
As I already mentioned, my big problem is associated with computational time, given that the databases are gigantic, that is, there is a computational inefficiency associated with this process. How could I perform such a procedure by paralleling the work of these functions, given the above computational structure?
I tried to do it as follows, based on the forum post highlighted at the beginning of this post:
Datajoin <- function(Data1,
Data2,
Data3){
fnctions <- Fun1(Data1, Data3) %>%
left_join(Fun2(Data1)) %>%
left_join(Fun3(Data1, Data2, Data3)) %>%
left_join(Fun4(Data1)) %>%
left_join(Fun5(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>%
left_join(Fun6(Data1, Data3), by = c('St_ab_Data1' = 'St_ab')) %>%
cbind(Data2 %>%
select(St_ab, Var2) %>%
mutate(Var2 = factor(Var2, levels = unique(Data3$Var2))) %>%
{model.matrix(~ -1 + .$Var2) %>%
as.data.frame}) %>%
left_join(Data2 %>%
as.data.frame %>%
select(St_ab, Var3,
contains("VarA"),
contains("VarB"),
contains("VarC"),
contains("VarD")) %>%
mutate(Var4 = year(Var3)), by = c('St_ab_Data1' = 'St_ab'))
names(fnctions)[names(fnctions) == "St_ab_Data1"] <- 'St_id'
return(fnctions)
}
tasks1 = list(wrk1 = function(x) Datajoin(x))
library(paralell)
clus = makeCluster(6)
clusterExport(clus, c('Datajoin',
'Data1', 'Data2', 'Data3'))
outPUT = clusterApply(
clus,
tasks1
)
stopCluster(clus)
However, in the same way that an error does not appear, it does not release the desired one.
Note: my Fun1 - Fun6 functions are their own functions, but I can't share them in this topic. I tried to adapt the post to be reproducible by introducing RBase functions but I came across a series of errors, I apologize in advance.