How to send a DELETE query to HBase via Spark Job

Question

I have this use case for an automated SparkSQL Job where I want to do this :

Read a table (let's call it table1) from Phoenix using Spark and gather in a DataFrame (let's call it df1) all the negative values that are found
Then I want to delete records from another table (table2) where values from a column are in df1 (thought about doing it a JOIN query but I wanted to know if this was possible with a DataFrame, and if there is an API using HBase and Spark DataFrames)
AFAIK Phoenix doesn't directly support DELETE operations via Spark (please do correct me if I'm wrong and if there is a way I'd gladly want to hear about it), which is why I'm more incline to use HBase Spark API

Here is a Schema to explain more visually :

schema

Here is some code.

Gather negative values in a DataFrame :

// Collect negative values
val negativeValues = spark
  .sqlContext
  .phoenixTableAsDataFrame("phoenix.table1", Seq(), conf = hbaseConf)
  .select('COLUMN1)
  .where('COLUMN2.lt(0))

// Send the query
[...]

Delete values from table2 where COLUMN1 is in negativeValues, so something like this in SQL (and if it's possible to apply the IN to the DF directly) :

DELETE FROM table2 WHERE COLUMN1 IN negativeValues

My expected result would be this :

table1

column1 |   column2
        |
123456  |   123
234567  |   456
345678  |   -789
456789  |   012
567891  |   -123



table2

column1 |   column2
        |
123456  |   321
234567  |   654
345678  |   945 <---- same column1 as table1's, so delete
456789  |   987
567891  |   675 <---- same column1 as table1's, so delete

So ultimately, I'd like to know if there's a way to send that DELETE request to HBase via Spark without too much fuss.

Thank you.

score 0 · Answer 1 · answered Apr 05 '19 at 05:03

you have to create a custom API if the need is to run the "DELETE" query from spark through Phoenix(sql engine) to Hbase.

The following approach can be used,

get the table2 rowkey column from the source dataframe to make the delete (on table2).
construct code to operate on each partition of source dataframe and build the "DELETE" Query. say query is "DELETE FROM table2 WHERE column1 = ? ", prepare it and execute it as batch with right batch size you see. since we executing it in parallel on each partition of the dataframe the number of partition in the source dataframe drives the parallelism. so you can try out re-partitioning it with right size to see right performance figures.

if the option is to skip the sql engine, you can also go with spark-hbase direct API. here is one such example - https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/org/apache/hadoop/hbase/spark/example/hbasecontext/HBaseBulkDeleteExample.scala

How to send a DELETE query to HBase via Spark Job

1 Answers1