0

I have a problem which can be handled by a recursive CTE, but not within an acceptable period of time. Can anyone point me at ways to improve the performance and/or get the same result a different way?

Here's my scenario!

I have : A large table which contains in each row an id, a start date, an end date, and a ranking number. There are multiple rows for each id and the date ranges often overlap. Dates are from 2010 onward.

I want: A table which contains a row for each combination of id + date which falls inside any date range for that id from the previous table. Each row should have the lowest ranking number for that id and day.

Eg:

ID  Rank  Range
1   1     1/1/2010-1/4/2010
1   2     1/2/2010-1/5/2010
2   1     1/1/2010-1/2/2010

becomes

ID  Rank  Day
1   1     1/1/2010
1   1     1/2/2010
1   1     1/3/2010
1   1     1/4/2010
1   2     1/5/2010
2   1     1/1/2010
2   1     1/2/2010

I can do this with a recursive CTE, but the performance is terrible (20-25 minutes for a relatively small data set which produces a final table with 31 million rows):

with enc(PersonID, EncounterDate, EndDate, Type_Rank) as (
select PersonID, EncounterDate, EndDate, Type_Rank
from Big_Base_Table
union all
select PersonID, EncounterDate + 1, EndDate, Type_Rank
from enc
where EncounterDate + 1 <= EndDate
)
select PersonID, EncounterDate, min(Type_Rank) Type_Rank
from enc
group by PersonID, EncounterDate
;

1 Answers1

0

You could extract all possible dates from the table once in a CTE, and then join that back to the table:

with all_dates (day) as (
  select start_date + level - 1
  from (
    select min(start_date) as start_date, max(end_date) as end_date
    from big_base_table
  )
  connect by level <= end_date - start_date + 1
)
select bbt.id, min(bbt.type_rank) as type_rank, to_char(ad.day, 'YYYY-MM-DD') as day
from all_dates ad
join big_base_table bbt
on bbt.start_date <= ad.day
and bbt.end_date >= ad.day
group by bbt.id, ad.day
order by bbt.id, ad.day;

        ID  TYPE_RANK DAY       
---------- ---------- ----------
         1          1 2010-01-01
         1          1 2010-01-02
         1          1 2010-01-03
         1          1 2010-01-04
         1          2 2010-01-05
         2          1 2010-01-01
         2          1 2010-01-02


7 rows selected. 

The CTE gets all dates from the lowest for any ID, up to the highest for any ID. You could also use a static calendar table for that if you have one, to save hitting the table twice (and getting min/max at the same time is slow in some versions at least).

You could also write it the other way round, as:

...
from big_base_table bbt
join all_dates ad
on ad.day >= bbt.start_date
and ad.day <= bbt.end_date
...

but I think the optimisier will probably end up treating them the same, with a single full scan of your base table; worth checking the plan it actually comes up with for both though, and if one is more efficnet that the other.

Alex Poole
  • 183,384
  • 11
  • 179
  • 318
  • Unfortunately this seems to be much slower, even with a static calendar table. (I stopped the query at 2 hours.) – Caitlin M. Shaw Mar 28 '17 at 20:57
  • On the same small dataset that took 20 minutes before? Out of interest, how many rows are there, and how long does it take to just to find the min/max dates (i.e runing the CTE's query onis own), and how big is that overall date range? The execution plan for this would be intersting too. – Alex Poole Mar 28 '17 at 21:57
  • Yeah, the same one. It's got ~56 million rows and it covers seven years from 2010 to 2016. With the calendar materialized and indexed, the explain plan (oops, edited here) is too long for this comment, but does a full table access on big_base_table, sort join, filter, merge join to the calendar table which thinks it's going to have 3365M rows (it will not!), then a hash group by. I tried materializing the calendar+base join without the group by and it's only ~58 million rows. – Caitlin M. Shaw Mar 29 '17 at 15:42