1

I have a master table (~100mm records) which needs to be updated/inserted with daily delta that gets processed every day.

Typical daily volume for delta would be few hundred thousand records. This can be implemented using full join or windowing function row_number+union all.

But my question is which out of these two is a better approach to go for using Hive (it’s running on Tez and the version is 2.1). We want to update all fields in master for a record which has a change in delta so would like to go with row_number+union all and looking for some optimization strategies.

octobus
  • 1,246
  • 1
  • 14
  • 20

2 Answers2

0

I think general rule of thumb is- avoid full joins - always.

I don't know how do you mean to use windowing function with union, but in my experience the following worked quite well (assuming join matches 1:1):

Full join

select 
    coalesce(x.a, y.a) as a,
    coalesce(x.b, y.b) as b,
    coalesce(x.c, y.c) as c,
    coalesce(x.d, y.d) as d,
    x.xe,
    y.ye
from
    x
full outer join y 
    on x.a=y.a and x.b=y.b

Union

select 
    a,
    b,
    coalesce(max(x_c), max(y_c)) as c,
    coalesce(max(x_d), max(y_d)) as d,
    max(xe) as xe,
    max(ye) as ye
from
    (select a,b,c as x_c,d as x_d, null as y_c, null as y_d, xe, null as ye from x
    union all 
     select a,b,null as x_c, null as x_d, c as y_c, d as y_d, null as xe, ye from y) main
group by 
    a,b
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
-1

Use explain your_sql; to observe the execution plan for the queries you're comparing.

For the case of implementing SCD1, full join and row_number+union yields very similar plan, both containing just one stage of map-reduce. So there's no obvious performance difference.

But do note there's more flexibility you can get out of full join.

damientseng
  • 533
  • 2
  • 19