3

In the spark-shell (scala), we import, org.apache.spark.sql.hive.thriftserver._ for starting Hive Thrift server programatically for a particular hive context as HiveThriftServer2.startWithContext(hiveContext) to expose a registered temp table for that particular session.

How can we do the same using python? Is there a package / api on python for importing HiveThriftServer? Any other thoughts / recommendations appreciated.

We have used pyspark for creating a dataframe

Thanks

Ravi Narayanan

  • why do you need a thrift server since it is a temporary tables? couldn't you just create your own Hivecontext which will connect to the local temporary created metastore? – user1314742 Apr 14 '16 at 17:08
  • And BTW, why do you need to start it from your code? – user1314742 Apr 14 '16 at 17:10
  • 1
    If we start the thrift server as a daemon, we are unable to view the temp table (the session is different from the session from which we start the HiveContext and temp table will be available for the particular session) – Ravi Narayanan Apr 18 '16 at 14:05
  • are you starting a metastore service? If not , I m not surprised, cause when you run Spark Thrift server, it will create its metastore backend. and whithin your code, also you create another metastore backend and the two metastores are independent. – user1314742 Apr 18 '16 at 15:47
  • Did you figure out how to do this? – user1158559 May 26 '16 at 17:19
  • @user1158559 did you figure out how to do this? – stackit Jun 16 '16 at 06:50
  • Unfortunately not - I switched to Scala. You might be able to do it through py4j. – user1158559 Jun 16 '16 at 08:21

1 Answers1

5

You can import it using py4j java gateway. The following code worked for spark 2.0.2 and could query temp tables registered in python script through beeline.

from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"")

spark = SparkSession \
        .builder \
        .appName(app_name) \
        .master(master)\
        .enableHiveSupport()\
        .config('spark.sql.hive.thriftServer.singleSession', True)\
        .getOrCreate()
sc=spark.sparkContext
sc.setLogLevel('INFO')

#Start the Thrift Server using the jvm and passing the same spark session corresponding to pyspark session in the jvm side.
sc._gateway.jvm.org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark._jwrapped)

spark.sql('CREATE TABLE myTable')
data_file="path to csv file with data"
dataframe = spark.read.option("header","true").csv(data_file).cache()
dataframe.createOrReplaceTempView("myTempView")

Then go to beeline to check if it correclty started:

in terminal> $SPARK_HOME/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
beeline> show tables;

It should show the tables and temp tables/views created in python including "myTable" and "myTempView" above. It is necessary to have the same spark session in order to see temporary views

(see ans: Avoid starting HiveThriftServer2 with created context programmatically.
NOTE: It's possible to access hive tables even if the Thrift server is started from terminal and connected to the same metastore, however temp views cannot be accessed as they are in the spark session and not written to metastore)

Community
  • 1
  • 1
Sasinda Rukshan
  • 439
  • 1
  • 5
  • 14