6

I am trying to run SQL queries using the spark.sql() or sqlContext.sql() method (here spark is the variable for SparkSession object available to us when we start EMR Notebook) on a public dataset using EMR notebook attached to an EMR cluster which has Hadoop, Spark and Livy installed. But on running any basic SQL query I face the error:

AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;

I want to use SQL queries so I do not want to use Dataframe API as an alternative.

This spark EMR cluster does not have a separate Hive component installed and I don't intend to use that. I have tried looking for various causes of this issue, one such cause could be that EMR notebook may not have write permission to create the metastore_db. However, I could not confirm this. I have tried to find this error in log files in the cluster but could not find it and am not sure which file may contain this error in order to get more details.

Steps to reproduce the problem:

  1. Create an AWS EMR cluster using the console and using quick start view, select spark option. It will include Spark 2.4.3 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.1. It can either just have 1 master and 2 core nodes, or even 1 master node by itself.

  2. Create an EMR Notebook from the Notebooks link in the EMR page, attach it to the cluster that you just created and open it (by default the kernel chosen will be pyspark as seen on top right of the notebook).

  3. The code I am using runs a spark.sql query on amazon reviews dataset which is public.
  4. Code:
# Importing data from s3
input_bucket = 's3://amazon-reviews-pds'
input_path = '/parquet/product_category=Books/*.parquet'
df = spark.read.parquet(input_bucket + input_path)
# Register temporary view
df.createOrReplaceTempView("reviews")
sqlDF = sqlContext.sql("""SELECT product_id FROM reviews LIMIT 5""")

I expect 5 product_id from this dataset to be returned however I get the error:

u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 358, in sql
    return self.sparkSession.sql(sqlQuery)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

3 Answers3

5

I had the same problem and I realized that I didn't have Hive on my EMR cluster.

After launching another cluster and making sure that Hive was selected, it worked.

Gustavo Muenz
  • 9,278
  • 7
  • 40
  • 42
1

Notebook should run on the EMR cluster with compatible HIVE version enter image description here

ankursingh1000
  • 1,349
  • 1
  • 15
  • 21
0

This is what worked for me. While launching the emr cluster under the Software configuration section I ensured that I checked the Use AWS Glue Data Catalog for table metadata checkboxenter image description here.

Jane Kathambi
  • 695
  • 6
  • 8