Kudu with PySpark2: Error with KuduStorageHandler

Question

I am trying to read data in stored as Kudu using PySpark 2.1.0

>>> from os.path import expanduser, join, abspath
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import Row
>>> spark = SparkSession.builder \
        .master("local") \
        .appName("HivePyspark") \
        .config("hive.metastore.warehouse.dir", "hdfs:///user/hive/warehouse") \
        .enableHiveSupport() \
        .getOrCreate()
>>> spark.sql("select count(*) from mySchema.myTable").show()

I have Kudu 1.2.0 installed on the cluster. Those are hive/ Impala tables.

When I execute the last line, I get the following error:

.
.
.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
.
.
.
aused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler
    at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:315)
    at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:284)
    ... 61 more
Caused by: java.lang.ClassNotFoundException: com.cloudera.kudu.hive.KuduStorageHandler

I am referring to the following resources:

I am interested to know how I can include the Kudu related dependencies into my pyspark program so that I can move past this error.

score 0 · Answer 1 · answered Aug 28 '17 at 16:04

0

The way I solved this issue was to pass the respective Jar for kudu-spark to the pyspark2 shell or to the spark2-submit command

answered Aug 28 '17 at 16:04

New Coder

499
4
22

I'm having the same problem, and I can not get it to work. Could you share your code? I have passed the kudu-spark2 jar to pyspark2, sparkcontext is created correctly as `spark` variable. But when I try to `spark.sql(...).show()` i get `Error in loading storage handler.com.cloudera.kudu.hive.KuduStorageHandler` – Susensio Mar 01 '18 at 09:28
The code remains same as above. The only difference is I am providing the maven package according to my configurations: https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2_2.11 As a helper code: https://github.com/asarraf/KuduPyspark/blob/master/kuduspark2.template.py – New Coder Mar 09 '18 at 02:30

score 0 · Answer 2 · edited Feb 24 '20 at 08:54

Apache Spark 2.3

Below is the code for your reference:

Read kudu table from pyspark with below code:

kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").load()

kuduDF.show(5)

Write to kudu table with below code:

DF.write.format('org.apache.kudu.spark.kudu').option('kudu.master',"IP of master").option('kudu.table',"impala::TABLE name").mode("append").save()

Reference link: https://medium.com/@sciencecommitter/how-to-read-from-and-write-to-kudu-tables-in-pyspark-via-impala-c4334b98cf05

If in case you want to use Scala below is the reference link:

https://kudu.apache.org/docs/developing.html

Kudu with PySpark2: Error with KuduStorageHandler

2 Answers2