I have a 2 tables in Hive which are managed using SCD Type 2 (https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row). Basically, each record has 2 columns valid_start_date and valid_end_date to specify the period during which it is valid.
The 2 tables:
Employee (Notice that employee_id 1 changed their address on 2010-10-20)
employee_id employee_name employee_address valid_start_date valid_end_date
1 Bob 123 XXX 2010-01-01 2010-10-20
2 Alice 999 YYY 2010-01-01 9999-12-31
1 Bob 567 ZZZ 2010-10-20 9999-12-31
Workplace (Notice that employee_id 1 changed building on 2010-12-31)
employee_id building_id valid_start_date valid_end_date
1 1 2010-01-01 2010-12-31
1 2 2010-12-31 9999-12-31
2 2 2010-01-01 9999-12-31
I want to join them and get this result
employee_id employee_name employee_address building_id valid_start_date valid_end_date
1 Bob 123 XXX 1 2010-01-01 2010-10-20
1 Bob 567 ZZZ 1 2010-10-20 2010-12-31
1 Bob 567 ZZZ 2 2010-12-31 9999-12-31
2 Alice 999 YYY 1 2010-01-01 9999-12-31
This is the query that I came up with
WITH
valid_dates AS (
SELECT
employee_id
, valid_start_date valid_date
FROM
employee
UNION
SELECT
employee_id
, valid_end_date valid_date
FROM
employee
UNION
SELECT
employee_id
, valid_start_date valid_date
FROM
workplace
UNION
SELECT
employee_id
, valid_end_date valid_date
FROM
workplace
),
valid_date_ranges AS (
SELECT
employee_id
, valid_start_date
, valid_end_date
FROM (
SELECT
employee_id
, valid_date valid_start_date
, LEAD(valid_date, 1) OVER (
PARTITION BY employee_id
ORDER BY valid_date) valid_end_date
FROM
valid_dates
) valid_date_ranges_with_null
WHERE
valid_end_date IS NOT NULL
)
SELECT
vdr.employee_id
, e.employee_name
, e.employee_address
, wp.building_id
, vdr.valid_start_date
, vdr.valid_end_date
FROM
employee e
INNER JOIN
valid_date_ranges vdr
ON
e.employee_id = vdr.employee_id
LEFT OUTER JOIN -- there may be employees without workplace
workplace wp
ON
wp.employee_id = vdr.employee_id
WHERE
e.valid_start_date < vdr.valid_end_date
AND e.valid_end_date > vdr.valid_start_date
AND wp.valid_start_date < vdr.valid_end_date
AND wp.valid_end_date > vdr.valid_start_date
;
I seems to produce the correct result but I want my query to run faster (this is currently the bottle neck of my pipeline). Also, I have to do the same thing with up to 5 tables joining together and up to 3 billion rows each so I really hope there is a way to optimize this query. Can you help me please? Thank you!
I'm on Hive 2.1.0 and there is no non-equi join yet by the way.