I have the following sample table which captures historical information using change data capture:
id cdc_date cdc_flag active name
1 2020-07-12 Insert true a
2 2020-07-12 Insert true b
3 2020-07-12 Insert true c
4 2020-07-12 Insert true d
1 2020-07-13 Update false a
3 2020-07-13 Update true c_changed
4 2020-07-14 Deleted true d
For a update on any of the columns a new entry is added to the table. Therefore, multiple versions of the same ID exists.
I need to find the total number of ids which were active on or before each cdc_date.
Desired Output:
cdc_date count_active
2020-07-12 4
2020-07-13 4
2020-07-14 3
I am unable to form a query in Athena for the desired output since there is no Procedures or With Recursive query available.
Following is my approach to calculate the active ids for one particular date:
id cdc_date cdc_flag active rank
1 2020-07-12 Insert true 2
2 2020-07-12 Insert true 1
3 2020-07-12 Insert true 2
4 2020-07-12 Insert true 2
1 2020-07-13 Update false 1
3 2020-07-13 Update true 1
4 2020-07-14 Deleted true 1
Select date('2020-07-14') as cdc_date, sum(if(active = 'true',1,0)) as count_active from
(Select *, rank over (partition by id over cdc_date desc) as rank)
where rank = 1 and cdc_flag != 'Deleted' and cdc_date <= date('2020-07-14')
I need to do this for each cdc_date but rank needs to be recalculated for each cdc_date and I am unable to think of a solution without procedure or with recursive.
Please suggest a solution to this using Athena/Presto SQL.