Can't create dplyr src backed by SparkSQL in dplyr.spark.hive package

Question

Recently I found out about great dplyr.spark.hive package that enables dplyr frontend operations with spark or hive backend .

There is an information on how to install this package in package's README :

options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")

and there are also many examples on how to work with dplyr.spark.hive when one is already connected to hiveServer - check this.

But I am not able to connect to hiveServer, so I can not benefit from the great power of this package...

I've tried such commands, but they did not work out. Does anyone have any solution or comment on what am I doing wrong?

> library(dplyr.spark.hive, 
+         lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
> 
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> 
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
> 
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+                      port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
> 
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() : 
  Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580.  Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1 
> 
> my_db = src_SparkSQL(start.server = TRUE,
+                      list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() : 
  Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580.  Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1

EDIT 2

I have set more paths to system variables like this but now I receive a warning telling me that some kind of Java logging-configuration is not specified bu I think it is

> library(dplyr.spark.hive, 
+         lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
3: package ‘SparkR’ was built under R version 3.2.1 
> 
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
> 
> 
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

My log properties are not empty.

-bash-4.2$ wc /etc/hadoop/log4j.properties 
 179  432 6581 /etc/hadoop/log4j.properties

EDIT 3

My exact call to the scr_SparkSQL() is

> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1 
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

And then the proces does not stop (never). Where those settings work for beeline with such params:

beeline  -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output

but I am not able to pass user name and dbname to src_SparkSQL() so I have tried to manual use the code from inside that function but I receive the sam problem that the below code also does not finish

host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
#   function(dr, url, retry){
#     if(retry > 0)
#       tryCatch(
#         dbConnect(drv = dr, url = url),
#         error =
#           function(e) {
#             Sys.sleep(0.1)
#             dbConnect_retry(dr = dr, url = url, retry - 1)})
#     else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))

Maybe the url should containg also /loghost - the dbname?

piccolbo · Answer 1 · 2015-12-15T02:19:22.183

1

I now see that you tried multiple things with multiple errors. Let me comment error by error.

my_db = src_SparkSQL() Error in .jfindClass(as.character(driverClass)[1]) : class not found

The RJDBC object could not be created. Unless we solve this, nothing else will work, workarounds or not. Have you set HADOOP_JAR with, for instance, Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar"). Sorry I seem to have skipped this in the instructions. Will fix.

my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl', + port = 10000) Error in .jfindClass(as.character(driverClass)[1]) : class not found

Same problem. Please note host port argument do not accept URL syntax, just host and port. URL is formed internally.

my_db = src_SparkSQL(start.server = TRUE) Error in start.server() : Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first. In addition: Warning message: running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1

Stop thriftserver first or connect to existing one, but you still have to fix the class not found problem.

my_db = src_SparkSQL(start.server = TRUE, + list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client")) Error in start.server() : Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first. In addition: Warning message: running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1

Same as above.

Plan:

Set HADOOP_JAR. Find host and port of running thriftserver, if not default. Try src_SparkSQL with start.server = FALSE. If happy quit, else goto step 2
Stop existing thriftserver. Try again src_SparkSQL with start.server = TRUE

Let me know how things go.

edited Dec 15 '15 at 02:19

answered Dec 14 '15 at 19:41

piccolbo

1,305
7
17

Thanks for help @piccolbo! I am still having problem after I specified more path to global variables. Could you kindly please have a look at my edit2? – Marcin Dec 15 '15 at 12:40
Please remove EDIT 1. It's your own fantasy. SparkR and the thriftserver are unrelated. This package uses the latter, orthogonal to the former. Instruction to start the thirftserver are on the spark site (google: spark thirftserver) or use start.server = TRUE as an argument to src_SparkSQL. Same site also explains how to set host and port. If you use env var, this package will try to read them. Defaults are localhost and 10000. The log4j warning is innocuous and I haven't figured it out myself. Maybe we need to report to the Spark folks. – piccolbo Dec 15 '15 at 16:34
Ok thanks, it looks like there is a long way before me, but I'm excited to learn new stuff. – Marcin Dec 15 '15 at 17:24
1

It shouldn't be hard, but it's immature sw and it will take some iterations. The simplest if you have spark installed is /sbin/start-thriftserver.sh. That starts it at defaults unless you have certain var set (see https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server). Then in R: mysrc = src_SparkSQL(). That should do it for a start. – piccolbo Dec 15 '15 at 19:38
I have verified that the HADOOP_JAR configuration issue is already covered in the README. – piccolbo Dec 15 '15 at 19:42
In the README of what? By the way, what if I can not restart thriftserver and I can not start new thristserver? I have a HiveServer2 that can not be restared. I am passing the same host and port to src_SparkSQL() as I am passing to beeline but src_SparkSQL() doesn't seem to want to finish working. – Marcin Dec 30 '15 at 13:52
1

README of dplyr.spark.hive; if hiveserver2 is up and running, that's great and also great you tested host and port with beeline. You are doing the right things. This may be a bug. Please paste here exact src_SparkSQL call. – piccolbo Dec 30 '15 at 18:03
I've edited my question with the exact call and methods I have tried. Thanks for consultations – Marcin Dec 30 '15 at 19:28

Marcin · Accepted Answer · 2016-01-14T15:12:42.957

There was a problem that I did't specify the proper classPath that was needed inside JDBC function that created a driver. Parameters to classPath in dplyr.spark.hive package are passed via HADOOP_JAR global variable.

To use JDBC as a driver to hiveServer2 (through the Thrift protocol) one need to add at least those 3 .jars with Java classes to create a proper driver

hive-jdbc-1.0.0-standalone.jar
hadoop/common/lib/commons-configuration-1.6.jar
hadoop/common/hadoop-common-2.4.1.jar

versions are arbitrary and should be compatible with the installed version of local hive, hadoop and hiveServer2.

They need to be set with the .Platform$path.sep (as described here)

classPath = c("system_path1_to_hive/hive/lib/hive-jdbc-1.0.0-standalone.jar",
                  "system_path1_to_hadoop/hadoop/common/lib/commons-configuration-1.6.jar",
                   "system_path1_to_hadoop/hadoop/common/hadoop-common-2.4.1.jar")
Sys.setenv(HADOOP_JAR= paste0(classPath, collapse=.Platform$path.sep)

Then when HADOOP_JAR is set one have to be carefull with hiveServer2 url. In my case it had to be

host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")

and finally the proper connection with hiveServer2 using RJDBC package is

Sys.setenv(HADOOP_HOME="/usr/share/hadoop/share/hadoop/common/")
Sys.setenv(HIVE_HOME = '/opt/hive/lib/')
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
driverclass = "org.apache.hive.jdbc.HiveDriver"
library(RJDBC)
.jinit()
dr2 = JDBC(driverclass,
           classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
                         #"/opt/hive/lib/commons-configuration-1.6.jar",
                         "/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
                         "/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"),
           identifier.quote = "`")

url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
dbConnect(dr2, url, username = "mkosinski") -> cont

Can't create dplyr src backed by SparkSQL in dplyr.spark.hive package

2 Answers2

Linked