There have been a few questions like this, with no answer, like this one here.
I thought I would post another in hopes of getting one.
I have a hive table with duplicate rows. Consider the following example:
*ID Date value1 value2*
1001 20160101 alpha beta
1001 20160201 delta gamma
1001 20160115 rho omega
1002 20160101 able charlie
1002 20160101 able charlie
When complete, I only want two records. Specifically, these two:
*ID Date value1 value2*
1001 20160201 delta gamma
1002 20160101 able charlie
Why those two? For the ID=1001, I want the latest date and the data that is in that row with it. For the ID=1002, really the same answer, but the two records with that ID are complete duplicates, and I only want one.
So, any suggestions on how to do this? The simple "group by" using the ID and the 'max' date won't work, as that ignores the other columns. I cannot put 'max' on those, as it will pull the max columns from all the records (will pull 'rho' from an older record), which is not good.
I hope my explanation is clear, and I appreciate any insight.
Thank you