Amazon provides a JDBC Driver for connecting to Hive. There is a JDBC package for R that seems like it should be able to make use of those drivers and allow access to Hive. In fact, others have clearly been able to access Hive from R with JDBC.
I have not been so lucky. I have successfully used the Amazon drivers via SQL Workbench/J following the instructions provided by Amazon to pull some data from Hive. Therefore, I know that in practice the .jars included in the Amazon JDBC driver are sufficient for connecting to the version of Hive (0.13.1) on my EMR cluster, and I also know that that the SSH tunnel has been set up correctly.
It seems like the following R code should suffice for connecting to Hive on the EMR cluster (following similar configuration options as for SQL Workbench/J and cribbing some notes from the "Using Hive from R with JDBC" blog post"):
if (!require("DBI")) {install.packages("DBI");library("DBI")}
if (!require("rJava")) {install.packages("rJava");library("rJava")}
if (!require("RJDBC")) {install.packages("RJDBC",dep=TRUE);library("RJDBC")}
JDBCLib <- "~/AmazonHiveJDBC/1.0.0.1000/"
for(l in list.files(JDBCLib)){ .jaddClassPath(paste(JDBCLib,l,sep=""))}
drv <- JDBC("com.amazon.hive.jdbc3.HS2Driver", paste0(JDBCLib,"HiveJDBC3.jar"))
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/default", user="hadoop")
The JDBC
function returns the drv object without complaint that has a structure that looks like the following:
Formal class 'JDBCDriver' [package "RJDBC"] with 2 slots
..@ identifier.quote: chr NA
..@ jdrv :Formal class 'jobjRef' [package "rJava"] with 2 slots
.. .. ..@ jobj :<externalptr>
.. .. ..@ jclass: chr "com/amazon/hive/jdbc3/HS2Driver"
However, try as I might, dbConnect
returns with
Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: Could not initialize class com.amazon.hive.hive.core.Hive2JDBCDriver
I vain I have tried specifying the user in the URL, specifying the dbname, not specifying the user at all, etc. I have also tried a couple variants of the URL as suggested here, e.g. "jdbc:hive2://localhost\default:10000;user=hadoop", but didn't have any success. At this point it seems like I am tantalizingly close to success, but not quite getting there. What error have I made? How can I correct it?