0

I have the following sample table which captures historical information using change data capture:

id cdc_date    cdc_flag active name
1  2020-07-12  Insert   true   a  
2  2020-07-12  Insert   true   b
3  2020-07-12  Insert   true   c
4  2020-07-12  Insert   true   d
1  2020-07-13  Update   false  a
3  2020-07-13  Update   true   c_changed
4  2020-07-14  Deleted  true   d

For a update on any of the columns a new entry is added to the table. Therefore, multiple versions of the same ID exists.

I need to find the total number of ids which were active on or before each cdc_date.

Desired Output:

cdc_date   count_active
2020-07-12 4
2020-07-13 4
2020-07-14 3

I am unable to form a query in Athena for the desired output since there is no Procedures or With Recursive query available.

Following is my approach to calculate the active ids for one particular date:

id cdc_date    cdc_flag active rank
1  2020-07-12  Insert   true   2
2  2020-07-12  Insert   true   1
3  2020-07-12  Insert   true   2
4  2020-07-12  Insert   true   2 
1  2020-07-13  Update   false  1
3  2020-07-13  Update   true   1 
4  2020-07-14  Deleted  true   1
Select date('2020-07-14') as cdc_date, sum(if(active = 'true',1,0)) as count_active from 
(Select *, rank over (partition by id over cdc_date desc) as rank)
where rank = 1 and cdc_flag != 'Deleted' and cdc_date <= date('2020-07-14')

I need to do this for each cdc_date but rank needs to be recalculated for each cdc_date and I am unable to think of a solution without procedure or with recursive.

Please suggest a solution to this using Athena/Presto SQL.

zealous
  • 7,336
  • 4
  • 16
  • 36

1 Answers1

0

You can use a cumulative sum with aggregation:

select cdc_date,
       sum(sum(case when cdc_flag = 'active' then 1
                    when cdc_flag = 'Deleted' then -1
                    else 0
               end)
          ) over (order by cdc_date) as num_actives
from t
group by cdc_date;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks for the reply. In this case, cdc_flag has values Insert, Update and Deleted. Therefore, for one ID there can be many entries over different dates. I need to use only the most recent one until the date in consideration. In this solution, it will sum more than once for the same ID. Just to add, this is only a sample data, the actual data set has a lot of other columns and for each update of any other column, there is a new entry in the table. – Pratik Kewalramani Jul 17 '20 at 12:54