PySpark HBase/Phoenix integration

Question

I'm supposed to read Phoenix data into pyspark.

edit: I'm using Spark HBase converters:

Here is a code snippet:

port="2181"
host="zookeperserver"
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "camel", "hbase.mapreduce.scan.columns": "data:a"}
sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
    jconf, batchSize)
  File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
    at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:130)

Any help would be much appreciated.

Thank you! /Tina

Raj Kumar Rai · Answer 1 · 2018-03-05T20:27:47.757

Using spark phoenix plugin is the recommended approach. please find details about phoenix spark plugin here

Environment : tested with AWS EMR 5.10 , PySpark

Following are the steps

Create Table in phoenix https://phoenix.apache.org/language/ Open Phoenix shell

“/usr/lib/phoenix/bin/sqlline.py“

DROP TABLE IF EXISTS TableName;

CREATE TABLE TableName (DOMAIN VARCHAR primary key);

UPSERT INTO TableName (DOMAIN) VALUES('foo');
Download spark phoenix plugin jar download spark phoenix plugin jar from https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core/4.11.0-HBase-1.3 you need phoenix--HBase--client.jar , i used phoenix-4.11.0-HBase-1.3-client.jar as per my phoenix and hbase version
From your hadoop home directory, setup the following variable:

phoenix_jars=/home/user/apache-phoenix-4.11.0-HBase-1.3-bin/phoenix-4.11.0-HBase-1.3-client.jar
Start PySpark shell and add the dependency in Driver and executer classpath

pyspark --jars ${phoenix_jars} --conf spark.executor.extraClassPath=${phoenix_jars}

--Create ZooKeeper URL ,Replace with your cluster zookeeper quorum, you can check from hbase-site.xml

emrMaster = "ZooKeeper URL" 

df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.load() 

df.show()
df.columns
df.printSchema()
df1=df.replace(['foo'], ['foo1'], 'DOMAIN')
df1.show() 

df1.write \
  .format("org.apache.phoenix.spark") \
  .mode("overwrite") \
  .option("table", "TableName") \
  .option("zkUrl", emrMaster) \
  .save()

score 0 · Answer 2 · answered Sep 15 '15 at 13:05

0

There are two ways to do this :
1) As Phoenix has a JDBC layer you can use Spark JdbcRDD to read data from Phoenix in Spark https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/rdd/JdbcRDD.html

2) Using Spark HBAse Converters: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala https://github.com/apache/spark/tree/master/examples/src/main/python

answered Sep 15 '15 at 13:05

Sachin Janani

1,310
1
17
33

I tried the second way, however Im getting an error: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.io.IOException: No table was provided. Have you done this in PYSPARK? – Ranic Sep 15 '15 at 14:11
Have you provided proper configurations to Spark newAPIHadoopRDD as follows: sparkconf = { "hbase.zookeeper.quorum": zookeeperhost, "hbase.mapreduce.inputtable": sampletable, "hbase.mapreduce.scan.columns": column} hbase_rdd = sc.newAPIHadoopRDD( "org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", keyConverter=keyConv, valueConverter=valueConv, conf=sparkconf) – Sachin Janani Sep 15 '15 at 14:26
Please try the above method i think you have not provided the table name in configuration.Also the value of keyConv and valueConv is examples.pythonconverters.ImmutableBytesWritableToStringConverter and examples.pythonconverters.HBaseResultToStringConverter respectively – Sachin Janani Sep 15 '15 at 14:31
Please follow the below link for the example:https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py – Sachin Janani Sep 15 '15 at 14:34
Thank you Sachin! However, I did all these steps. 1. I start pyspark with the needed jars, these are /hbase-0.90.1.jar, $SPARK_HOME/lib/spark-examples-1.3.1.2.3.0.0-2557hadoop2.7.1.2.3.0.0-2557.jar 2. Then (in IPython Notebook): from pyspark import SparkContext port="2181" host="hostname" keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter" valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter" – Ranic Sep 15 '15 at 14:39
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable”: "test", "hbase.mapreduce.scan.columns”: "f1:a"} – Ranic Sep 15 '15 at 14:39
And when I run: cmdata_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf) I receive the above error. – Ranic Sep 15 '15 at 14:42
Is table in default namespace?Can you join chat – Sachin Janani Sep 15 '15 at 14:49

PySpark HBase/Phoenix integration

2 Answers2