I have a master table (~100mm records) which needs to be updated/inserted with daily delta that gets processed every day.
Typical daily volume for delta would be few hundred thousand records. This can be implemented using full join
or windowing function row_number+union
all.
But my question is which out of these two is a better approach to go for using Hive (it’s running on Tez and the version is 2.1). We want to update all fields in master for a record which has a change in delta so would like to go with row_number+union
all and looking for some optimization strategies.