I have to ingest loads of data every day into hbase.
Averagely I load 102*(10^6) records into hbase.
However, I can't just load this data into Hbase since I have to compare each record with 1 month older data and check for duplicates. In case there is a duplicate I have to keep only one of the two values.
Here is an example:
tableTest(pk, value)
record1: (id:1,val:5) record2: (id:1,val:8)
in this case I'll keep into hbase (id:1, val:max(8,5))
Now, since I'm processing this data in Spark and then saving the rdd directly to hbase through Phoenix api saveToPhoenix (that does under the hood a lot of upsert) one solution would be to load in Spark the one month old data and doing all the updates at rdd level and then save it.
However this solution would be quite inefficient since I should load roughly (102*(10^6))*30 records plus it has other drawback specific to the problem I'm trying to solve.
I was wondering if there is a trigger-like mechanism in Phoenix that allow me to handle this logic (keep the max between val1 and val2 during an upsert) on the DB side.
Is the coprocessor feature the closest and only solution?