Load only part of HBase/Phoenix table as Spark Datafrom

Question

I am using the following code in Spark to load specified columns of my HBase/Phoenix table into a Spark Dataframe. I can specify the columns I want to load, but can I specify which rows? Or do I have to load all rows?

import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._

sc.stop()

val sc = new SparkContext("local", "phoenix-test")
val df = sqlContext.phoenixTableAsDataFrame(
     "TABLENAME", Array("ROWKEY", "CF.COL1","CF.COL2","CF.COL3"), conf = configuration
     )

If you are trying to filter by row, I think that kind of computing should be done in Spark instead of Phoenix — Anish Nair, Feb 06 '20 at 01:00

score 0 · Answer 1 · answered May 18 '23 at 14:29

You can add a predicate to your call to restrict which rows get retrieved, e.g.,

val df = sqlContext.phoenixTableAsDataFrame(
     "TABLENAME", Array("ROWKEY", "CF.COL1","CF.COL2","CF.COL3"),
     conf = configuration,
     predicate = Some("ROWKEY IN ('1', '2')")
     )

Load only part of HBase/Phoenix table as Spark Datafrom

1 Answers1