19

I'm working on adding HiveServer2 support to my company's R data-access package. I'm curious what the best way of generating an R Thrift client would be. I'm considering writing an R wrapper around the Java Thrift client, similar to what rhbase does, but I'd prefer a pure R solution, if possible.

Things to note:

  • HiveServer2 thrift server is different from the original Hive Thrift server.
  • I've looked at and used the RHive package. Among other issues I have with it, it requires a system-install of Hadoop and Hive, which will not always be available on R client machines.
  • My somewhat horrible - but currently sufficient - workaround is to wrap the beeline client in some R goodness.
yoni
  • 5,686
  • 3
  • 27
  • 28
  • 1
    I wrote a dplyr backend for spark and hive. In both cases it uses RJDBC to connect to the HS2 thrift server. RJDBC needs some jar from the spark or Hadoop distros on some path. So I am kind of under the impression that if you connect to a JDBC interface, that's what you have to have, but I've never read the definitive statement supporting that. I will research more. I share your concern, but I don't understand how packaging beeline instead of a jar in a package makes anything better. Who installs beeline? It's still an external dep and makes installs hard. – piccolbo Nov 03 '15 at 01:40
  • 2
    Just to clarify, the following two options are not sufficient for your needs? 1. Install R on a (edge) node in your cluster. 2. Pull data via JDBC from outside the cluster – Dennis Jaheruddin Aug 01 '17 at 08:13
  • Thanks, Dennis. First, note that this question is a bit old. I'm not longer trying to solve this problem. Still, the #1 option you mention is certainly possible (and I've used it to solve other problems), but not relevant to this question. The point here was to run and retrieve results of a query, not to run R as part of the query itself. Your #2 option would certainly be another way of going about solving the question posed. That's probably what I'd recommend to anyone trying to create an R Hive client. Thank you for highlighting that option. – yoni Aug 03 '17 at 05:48

1 Answers1

1

The exact scope of this question may be too broad for Stackoverflow and the asker confirmed he abandoned this quest, but for future readers this is probably the thing to look for:

From R you can connect to Hive with JDBC.

This is not exactly what the asker came for, but it should serve the purpose in most cases.


The key part in the solution for this would be the RJDBC package, here is some example code found on the Cloudera Community

library(DBI)
library(rJava)
library(RJDBC)
hadoop.class.path = list.files(path=c("/usr/hdp/2.4.0.0-169/hadoop"),pattern="jar", full.names=T);
hive.class.path = list.files(path=c("/usr/hdp/current/hive-client/lib"),pattern="jar", full.names=T);
hadoop.lib.path = list.files(path=c("/usr/hdp/current/hive-client/lib"),pattern="jar",full.names=T);
mapred.class.path = list.files(path=c("/usr/hdp/current/hadoop-mapreduce-client/lib"),pattern="jar",full.names=T);
cp = c(hive.class.path,hadoop.lib.path,mapred.class.path,hadoop.class.path)
drv <- JDBC("org.apache.hive.jdbc.HiveDriver","hive-jdbc.jar",identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://ixxx:10000/default", "hive", "hive")
show_databases <- dbGetQuery(conn, "show databases")

Full disclosure: I am an employee of cloudera.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122