I'm trying to make a temporary table a create on pyspark available via Thrift. My final goal is to be able to access that from a database client like DBeaver using JDBC.
I'm testing first using beeline.
This is what i'm doing.
- Started a cluster with one worker in my own machine using docker and added
spark.sql.hive.thriftServer.singleSession true
onspark-defaults.conf
Started Pyspark shell (for testing sake) and ran the following code:
from pyspark.sql import Row l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = sc.parallelize(l) people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) people = people.toDF().cache() peebs = people.createOrReplaceTempView('peebs') result = sqlContext.sql('select * from peebs')
So far so good, everything works fine.
On a different terminal I initialize spark thrift server:
./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --conf spark.executor.cores=1 --master spark://172.18.0.2:7077
The server appears to start normally and I'm able to see both pyspark and thrift server jobs running on my spark cluster master UI.
I then connect to the cluster using beeline
./bin/beeline beeline> !connect jdbc:hive2://172.18.0.2:10001
This is what I got
Connecting to jdbc:hive2://172.18.0.2:10001
Enter username for jdbc:hive2://172.18.0.2:10001:
Enter password for jdbc:hive2://172.18.0.2:10001:
2019-06-29 20:14:25 INFO Utils:310 - Supplied authorities: 172.18.0.2:10001
2019-06-29 20:14:25 INFO Utils:397 - Resolved authority: 172.18.0.2:10001
2019-06-29 20:14:25 INFO HiveConnection:203 - Will try to open client transport with JDBC Uri: jdbc:hive2://172.18.0.2:10001
Connected to: Spark SQL (version 2.3.3)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READSeems to be ok.
When I list
show tables;
I can't see anything.
Two interesting things I'd like to highlight is:
When I start pyspark I get these warnings
WARN ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
WARN ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
When I start the thrift server I get these:
rsync from spark://172.18.0.2:7077
ssh: Could not resolve hostname spark: Name or service not known
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(235) [Receiver=3.1.2]
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to ...
I've been through several posts and discussions. I see people saying we can't have temporary tables exposed via thrift unless you start the server from within the same code. If that's true how can I do that in python (pyspark)?
Thanks