3

Can someone point me to a working example of saving a csv file to Hbase table using Spark 2.2 Options that I tried and failed (Note: all of them work with Spark 1.6 for me)

  1. phoenix-spark
  2. hbase-spark
  3. it.nerdammer.bigdata : spark-hbase-connector_2.10

All of them finally after fixing everything give similar error to this Spark HBase

Thanks

Rahul Sharma
  • 5,614
  • 10
  • 57
  • 91
abstractKarshit
  • 1,355
  • 2
  • 16
  • 34

2 Answers2

3

Add below parameters to your spark job-

spark-submit \
--conf "spark.yarn.stagingDir=/somelocation" \
--conf "spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/s‌​omelocation" \
--conf "spark.hadoop.mapred.output.dir=/somelocation"
Rahul Sharma
  • 5,614
  • 10
  • 57
  • 91
  • I have setup HBase and Phoenix locally and I did what you said. Added these configs in the code itself. Again, I got the same error. Although in both cases, with and without config: Data is loaded successfully and than it gives me the error. – abstractKarshit Sep 28 '17 at 22:31
  • Based on my analysis, job fails at `HadoopMapReduceCommitProtocol.absPathStagingDir` because outputpath is empty, which is supplied correctly using `mapreduce.output.fileoutputformat.outputdir` parameters. The hadoop configuration is populated to hadoopConf using SparkHaddopUtil and everything looks correct to me. can you please add these params as well to sparkConf object- `spark.hadoop.mapreduce.output.dir` , `spark.hadoop.mapred.output.fileoutputformat.outputdir` – Rahul Sharma Sep 30 '17 at 05:30
0

Phoexin has plugin and jdbc thin client which can connect(read/write) to HBASE, example are in https://phoenix.apache.org/phoenix_spark.html

Option 1 : Connect via zookeeper url - phoenix plugin

            import org.apache.spark.SparkContext
            import org.apache.spark.sql.SQLContext
            import org.apache.phoenix.spark._

            val sc = new SparkContext("local", "phoenix-test")
            val sqlContext = new SQLContext(sc)

            val df = sqlContext.load(
              "org.apache.phoenix.spark",
              Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
            )

            df
              .filter(df("COL1") === "test_row_1" && df("ID") === 1L)
              .select(df("ID"))
              .show

Option 2 : Use JDBC thin client provied by phoenix query server

more info on https://phoenix.apache.org/server.html

jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF
Augustine
  • 106
  • 7