I have a dataframe df
with an id
column. This maps to many rows 1:n) in my database table. Querying each ID sequentially takes about an hour to complete, so I'm trying to run multiple queries at once using the doparallel package. There is overhead with handling job querying, so I've broken down each query to handle 'chunks' of rows at a time.
I have a 4 core, 8 thread computer. Sequentially (running one query at a time w/o parallelization)- the task takes 52 minutes to complete. In parallel, it takes 38 minutes to complete. This is not a great improvement, and I'm looking for means to speed this up (highly likely I'm doing something wrong).
Code for sequntial query:
conn <- dbConnect(driver, params)
for(i in 1:nrow(df)){
df_output = rbind(df_output , dbGetQuery(conn, paste0("
SELECT
id, DateTime, a, b,
FROM table
WHERE id = '",df$id[i],"'
")))
}
In parallel:
cl <- makePSOCKcluster(max(1, detectCores() - 1))
registerDoParallel(cl)
#Make DBConnection per worker
clusterEvalQ(cl, {
#libraries go here
Sys.setenv(TZ='GMT')
driver <- dbDriver("PostgreSQL")
conn <- dbConnect(driver, params)
NULL
})
mcoptions <- list( preschedule=FALSE,
mc.cores = max(1,detectCores()-1) )
df_output = foreach(i=isplitVector(1:NROW(df), chunks=99),
.verbose = TRUE,
.multicombine = TRUE,
.inorder=FALSE,
.combine=rbind,
.options.multicore=mcoptions,
.noexport="conn",
.packages=c("RJDBC", "RPostgreSQL")) %dopar% {
df_output = setNames(data.frame(matrix(ncol = 4, nrow = 0)), c("id", "dateTime", 'a', 'b'))
#Query the database for each set of j elements for the vector i
for(i in 1:nrow(df)){
df_output = rbind(df_output , dbGetQuery(conn, paste0("
SELECT
id, DateTime, a, b,
FROM table
WHERE id = '",df$id[i],"'
")))
}
return(df_output )
}
#Make the workers disconnect from the database
clusterEvalQ(cl, {
dbDisconnect(conn)
})
#Kill the workers
stopCluster(cl)
stopImplicitCluster()
Edit- Output from verbose:
discovered package(s):
automatically exporting the following variables from the local environment:
df
explicitly exporting package(s): RJDBC, RPostgreSQL
numValues: 99, numResults: 0, stopped: TRUE
got results for task 1
numValues: 99, numResults: 1, stopped: TRUE
returning status FALSE
got results for task 2
numValues: 99, numResults: 2, stopped: TRUE
returning status FALSE
got results for task 3
numValues: 99, numResults: 3, stopped: TRUE
returning status FALSE
got results for task 4
numValues: 99, numResults: X, stopped: TRUE
returning status FALSE
got results for task X+1
...
first call to combine function
evaluating call object to combine results:
fun(result.1, ... result.99)
returning status TRUE