multidplyr with database connection

Question

I'm trying to use multidplyr to run a do command that runs a custom function that queries a Vertica database using RJDBC. I have no problem running the multidplyr examples or querying the database directly, but when I try to connect to the database in multidplyr I get the error:

Error in checkForRemoteErrors(lapply(cl, recvResult)) : 3 nodes produced errors; first error: No running JVM detected. Maybe .jinit() would help.

I've tried the suggestion in the comment here to make the cluster manually, passing the vertica database connection object, but I still get an error that "No running JVM detected". I'm guessing that this is because I need to tell each node to start up a JVM, but I don't now how to do this.

My code, apologies that it's not reproducible as I can't share the database:

# set up DB connection
vertica <- dbConnect(vDriver, ...connection info...)

# create cluster
cluster3 <- create_cluster(3)
parallel::clusterExport(cl = cluster3, c("getData", "vertica"))

# run function in parallel using multidplyr
accounts_part <- multidplyr::partition(accounts, accountId, cluster = cluster3)

accounts_data <- accounts_part %>% 
  group_by(accountId) %>%
  do(getData(ac = .$accountId, vertica = vertica))

score 0 · Answer 1 · answered Jan 28 '17 at 11:30

I cannot reproduce your code so my answere will be based on my experience (I had similar issue with assign (different) random number to each cluster) and on both this and this. Hope that it works.

I think your main issue is that your connection is made to link your db with the actual session and so you cannot "copy" it on each node, because they need their own "way" to the db.

On the other hand you still need to have the same name for each connection, to be able to manage them in a single (parallel) call. So, I think you have only to create the connections directly into each node instead of create one connection in the actual session and than copy it into the nodes.

# create cluster
cluster3 <- create_cluster(3)

# export data, function, ...
cluster_assign_value(cluster, 'getData', getData)

# setup DB connection in each cluster (evaluated in each cluster)
cluster_eval(cluster, vertica <- dbConnect(vDriver, ...connection info...))
# not
# cluster_assign_value(cluster,
#     'vertica', dbConnect(vDriver, ...connection info...)
# )
# which is evaluate locally and so it is the same as your code

# run function in parallel using multidplyr
accounts_part <- multidplyr::partition(accounts, accountId,
    cluster = cluster3
)

accounts_data <- accounts_part %>% 
    # group_by(accountId) %>% # partition() have already grouped them
    do(getData(ac = .$accountId, vertica = vertica)) # %>%
    # collect() # if you have no more computations to do and want get back
                # the results locally

multidplyr with database connection

1 Answers1