I am trying to read data in stored as Kudu using PySpark 2.1.0
>>> from os.path import expanduser, join, abspath
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder \
.master("local") \
.appName("HivePyspark") \
.config("hive.metastore.warehouse.dir", "hdfs:///user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
>>> spark.sql("select count(*) from mySchema.myTable").show()
I have Kudu 1.2.0 installed on the cluster. Those are hive/ Impala tables.
When I execute the last line, I get the following error:
.
.
.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
.
.
.
aused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:315)
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:284)
... 61 more
Caused by: java.lang.ClassNotFoundException: com.cloudera.kudu.hive.KuduStorageHandler
I am referring to the following resources:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
https://github.com/bkvarda/iot_demo/blob/master/total_data_count.py
https://kudu.apache.org/docs/developing.html#_kudu_python_client
I am interested to know how I can include the Kudu related dependencies into my pyspark program so that I can move past this error.