I'm trying to use multidplyr to run a do
command that runs a custom function that queries a Vertica database using RJDBC. I have no problem running the multidplyr
examples or querying the database directly, but when I try to connect to the database in multidplyr
I get the error:
Error in checkForRemoteErrors(lapply(cl, recvResult)) : 3 nodes produced errors; first error: No running JVM detected. Maybe .jinit() would help.
I've tried the suggestion in the comment here to make the cluster manually, passing the vertica
database connection object, but I still get an error that "No running JVM detected". I'm guessing that this is because I need to tell each node to start up a JVM, but I don't now how to do this.
My code, apologies that it's not reproducible as I can't share the database:
# set up DB connection
vertica <- dbConnect(vDriver, ...connection info...)
# create cluster
cluster3 <- create_cluster(3)
parallel::clusterExport(cl = cluster3, c("getData", "vertica"))
# run function in parallel using multidplyr
accounts_part <- multidplyr::partition(accounts, accountId, cluster = cluster3)
accounts_data <- accounts_part %>%
group_by(accountId) %>%
do(getData(ac = .$accountId, vertica = vertica))