Using JDBC Driver for Hive on Amazon EMR in R

Question

Amazon provides a JDBC Driver for connecting to Hive. There is a JDBC package for R that seems like it should be able to make use of those drivers and allow access to Hive. In fact, others have clearly been able to access Hive from R with JDBC.

I have not been so lucky. I have successfully used the Amazon drivers via SQL Workbench/J following the instructions provided by Amazon to pull some data from Hive. Therefore, I know that in practice the .jars included in the Amazon JDBC driver are sufficient for connecting to the version of Hive (0.13.1) on my EMR cluster, and I also know that that the SSH tunnel has been set up correctly.

It seems like the following R code should suffice for connecting to Hive on the EMR cluster (following similar configuration options as for SQL Workbench/J and cribbing some notes from the "Using Hive from R with JDBC" blog post"):

if (!require("DBI")) {install.packages("DBI");library("DBI")}
if (!require("rJava")) {install.packages("rJava");library("rJava")}
if (!require("RJDBC")) {install.packages("RJDBC",dep=TRUE);library("RJDBC")}
JDBCLib <- "~/AmazonHiveJDBC/1.0.0.1000/"
for(l in list.files(JDBCLib)){ .jaddClassPath(paste(JDBCLib,l,sep=""))}
drv <- JDBC("com.amazon.hive.jdbc3.HS2Driver", paste0(JDBCLib,"HiveJDBC3.jar"))
conn <- dbConnect(drv, "jdbc:hive2://localhost:10000/default", user="hadoop")

The JDBC function returns the drv object without complaint that has a structure that looks like the following:

Formal class 'JDBCDriver' [package "RJDBC"] with 2 slots
  ..@ identifier.quote: chr NA
  ..@ jdrv            :Formal class 'jobjRef' [package "rJava"] with 2 slots
  .. .. ..@ jobj  :<externalptr> 
  .. .. ..@ jclass: chr "com/amazon/hive/jdbc3/HS2Driver"

However, try as I might, dbConnect returns with

Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1],  : 
  java.lang.NoClassDefFoundError: Could not initialize class com.amazon.hive.hive.core.Hive2JDBCDriver

I vain I have tried specifying the user in the URL, specifying the dbname, not specifying the user at all, etc. I have also tried a couple variants of the URL as suggested here, e.g. "jdbc:hive2://localhost\default:10000;user=hadoop", but didn't have any success. At this point it seems like I am tantalizingly close to success, but not quite getting there. What error have I made? How can I correct it?

score 0 · Accepted Answer · edited May 23 '17 at 10:28

0

Another question, which has received no answer, included .jinit() in the example code that they had used to successfully connect to hive. This was the missing special sauce. .jinit() launches the JVM and must be called prior to adding the jars to the class path.

edited May 23 '17 at 10:28

Community

1
1

answered Feb 20 '15 at 17:46

russellpierce

4,583
2
32
44

Using JDBC Driver for Hive on Amazon EMR in R

1 Answers1