I have this use case for an automated SparkSQL Job where I want to do this :
Read a table (let's call it table1) from Phoenix using Spark and gather in a DataFrame (let's call it df1) all the negative values that are found
Then I want to delete records from another table (table2) where values from a column are in df1 (thought about doing it a JOIN query but I wanted to know if this was possible with a DataFrame, and if there is an API using HBase and Spark DataFrames)
AFAIK Phoenix doesn't directly support DELETE operations via Spark (please do correct me if I'm wrong and if there is a way I'd gladly want to hear about it), which is why I'm more incline to use HBase Spark API
Here is a Schema to explain more visually :
Here is some code.
Gather negative values in a DataFrame :
// Collect negative values
val negativeValues = spark
.sqlContext
.phoenixTableAsDataFrame("phoenix.table1", Seq(), conf = hbaseConf)
.select('COLUMN1)
.where('COLUMN2.lt(0))
// Send the query
[...]
Delete values from table2 where COLUMN1 is in negativeValues, so something like this in SQL (and if it's possible to apply the IN to the DF directly) :
DELETE FROM table2 WHERE COLUMN1 IN negativeValues
My expected result would be this :
table1
column1 | column2
|
123456 | 123
234567 | 456
345678 | -789
456789 | 012
567891 | -123
table2
column1 | column2
|
123456 | 321
234567 | 654
345678 | 945 <---- same column1 as table1's, so delete
456789 | 987
567891 | 675 <---- same column1 as table1's, so delete
So ultimately, I'd like to know if there's a way to send that DELETE request to HBase via Spark without too much fuss.
Thank you.