2

I have a dataframe df with an id column. This maps to many rows 1:n) in my database table. Querying each ID sequentially takes about an hour to complete, so I'm trying to run multiple queries at once using the doparallel package. There is overhead with handling job querying, so I've broken down each query to handle 'chunks' of rows at a time.

I have a 4 core, 8 thread computer. Sequentially (running one query at a time w/o parallelization)- the task takes 52 minutes to complete. In parallel, it takes 38 minutes to complete. This is not a great improvement, and I'm looking for means to speed this up (highly likely I'm doing something wrong).

Code for sequntial query:

conn <- dbConnect(driver, params)

for(i in 1:nrow(df)){
  df_output = rbind(df_output , dbGetQuery(conn, paste0("
SELECT 
  id, DateTime, a, b,
FROM table
WHERE id = '",df$id[i],"'
")))
}

In parallel:

cl <- makePSOCKcluster(max(1, detectCores() - 1))
registerDoParallel(cl)

#Make DBConnection per worker
clusterEvalQ(cl, {
  #libraries go here
  Sys.setenv(TZ='GMT')
  driver <- dbDriver("PostgreSQL")
  conn <- dbConnect(driver, params)
  NULL
})

mcoptions <- list(  preschedule=FALSE, 
                    mc.cores = max(1,detectCores()-1)  )

df_output = foreach(i=isplitVector(1:NROW(df), chunks=99), 
        .verbose = TRUE,
        .multicombine = TRUE,
        .inorder=FALSE,
        .combine=rbind,
        .options.multicore=mcoptions,
        .noexport="conn",
        .packages=c("RJDBC", "RPostgreSQL")) %dopar% {

  df_output = setNames(data.frame(matrix(ncol = 4, nrow = 0)), c("id", "dateTime", 'a', 'b'))

  #Query the database for each set of j elements for the vector i          
  for(i in 1:nrow(df)){
    df_output = rbind(df_output , dbGetQuery(conn, paste0("
      SELECT 
        id, DateTime, a, b,
      FROM table
      WHERE id = '",df$id[i],"'
      ")))
  }
  return(df_output )

}

#Make the workers disconnect from the database
clusterEvalQ(cl, {
  dbDisconnect(conn)
})

#Kill the workers
stopCluster(cl)
stopImplicitCluster()

Edit- Output from verbose:

discovered package(s): 
automatically exporting the following variables from the local environment:
  df 
explicitly exporting package(s): RJDBC, RPostgreSQL
numValues: 99, numResults: 0, stopped: TRUE
got results for task 1
numValues: 99, numResults: 1, stopped: TRUE
returning status FALSE
got results for task 2
numValues: 99, numResults: 2, stopped: TRUE
returning status FALSE
got results for task 3
numValues: 99, numResults: 3, stopped: TRUE
returning status FALSE
got results for task 4
numValues: 99, numResults: X, stopped: TRUE
returning status FALSE
got results for task X+1
...
first call to combine function
evaluating call object to combine results:
  fun(result.1, ... result.99)
returning status TRUE
CorerMaximus
  • 653
  • 5
  • 15
  • (1) I understand that you are giving sample code, but please provide *working* sample code: your double-quotes are nested/escaped incorrectly. (2) You should really not use `paste` to put *values* in a query, guarding against SQL injection (https://xkcd.com/327/). Better to use `DBI::dbBind` or at least `glue::glue_sql`. – r2evans Aug 20 '19 at 00:10
  • (3) Does each individual query (on its own) take an hour to *process* or an hour to *download*? If the latter, then parallelizing it really won't help much. If the former, perhaps it would be better to look at the query itself (perhaps more appropriate on https://dba.stackexchange.com/). – r2evans Aug 20 '19 at 00:11
  • @r2evans I just fixed the double quote issue, my apologies! I'll look into dbBind and change my code accordingly. Each query takes about 0.5-3 seconds to complete. – CorerMaximus Aug 20 '19 at 00:15
  • I have found that "normal" download speeds can be reduced by using the `psql` tool to a local file. From there, running one query per id can also be problematic ... it might be better to insert all IDs into a temporary table and change your query to be `select id,datetime,a,b from table where id in (select id from temptable)` (though a join-op might be equally effective). – r2evans Aug 20 '19 at 00:15
  • I haven't used `RJDBC`, so I don't know, but do you get "good performance" out of it? I've used `RPostgres` and `odbc`, and native (the former) is just about "always" faster in both read and write. – r2evans Aug 20 '19 at 00:18
  • The actual `df` with the `id` column has a `start` and `end` column. The actual query I'm running makes use of that start/end, and I'm unsure how to filter on all 3 parameters if using the `where var in (x, y, z)` structure, so am querying it one at a time. – CorerMaximus Aug 20 '19 at 00:22
  • In terms of performance, the query above takes about the same time to execute as executing it on Datagrip. The query is evaluated and executed by the database as opposed to R, so the only performance hit would be in sending and retrieving the query to the DB. – CorerMaximus Aug 20 '19 at 00:23
  • Performance in the DBMS is obviously (and "always") going to be better than similar performance in R, sure. You are right, using `where id in (...)` will only work with single fields, but using the "join" technique can be accomplished with inequalities. (I admit I'm not a pro on inequality-join performance on DBMS, but if the query is structured right, I imagine it could be fast.) Are you certain that multiplying the bandwidth can be accommodated by both ends of the network pipe? I'm not doubting it's possible, but often it's not the bottleneck. – r2evans Aug 20 '19 at 00:40
  • I had not considered the network bandwidth as a possible issue. I can retry it with a wired connection and report back in a couple of hours. It could be a limitation in the number of concurrent connections or queries. – CorerMaximus Aug 20 '19 at 00:52
  • 1
    Not to say I've tested it extensively, but I've never found "parallelized R connections" to outperform a single bulk-copy. If you really need to get a lot of data as quickly as possible, consider using `psql` directly, and parallelize that outside of R. – r2evans Aug 20 '19 at 00:59

0 Answers0