3

say you have a table of customers with dates as follows:
[customer_table]

+----------+-----------+----------+
| customer | date      | purchase |
+----------+-----------+----------+
| 1        | 1/01/2016 | 12       |
+----------+-----------+----------+
| 1        | 1/12/2016 | 3        |
+----------+-----------+----------+
| 2        | 5/03/2016 | 5        |
+----------+-----------+----------+
| 3        | 1/16/2016 | 6        |
+----------+-----------+----------+
| 3        | 3/22/2016 | 1        |
+----------+-----------+----------+  

I want to write a query to count how many distinct customers made a purchase in the last 10 days as a rolling period, starting from each calendar day and counting 10 days backwards. So for each unique day in 2016 the final output would be a calendar, where each day has a count of distinct customers that exist in the prior 10 days of that day of the calendar like the following:
[result_table]

+-----------+------------------+
| date      | unique customers |
+-----------+------------------+
| 1/01/2016 | 112              |
+-----------+------------------+
| 1/02/2016 | 104              |
+-----------+------------------+
| 1/03/2016 | 140              |
+-----------+------------------+
| 1/04/2016 | 133              |
+-----------+------------------+
| ....      | 121              |
+-----------+------------------+  

One solution I came up with was to create a calendar table that is one single column, then join the calendar table to the customer table with an inequality join. I believe this is extremely inefficient and am seeking a faster solution. So my first step was create a calendar like so:
[calendar]

+-----------+
| date      |
+-----------+
| 1/01/2016 |
+-----------+
| 1/02/2016 |
+-----------+
| 1/03/2016 |
+-----------+
| 1/04/2016 |
+-----------+
| 1/05/2016 |
+-----------+  

Then for each day in that calendar, to count the distinct set of customers prior to each day, I join with an inequality like so:

select
count(distinct customer) as unique customers
from calendar c
left join mytable m
on c.date>=m.date and m.date>=c.date-10  

While I believe this is correct, it runs extremely slow (say for 2 years of a calendar with a few million customers). Is there an oracle analytic function that may help me out here?

barker
  • 1,005
  • 18
  • 36

1 Answers1

4

Is there an oracle analytic function that may help me out here?

Not really - from the COUNT() documentation:

If you specify DISTINCT, then you can specify only the query_partition_clause of the analytic_clause. The order_by_clause and windowing_clause are not allowed.

You would want both DISTINCT and a windowing_clause which isn't allowed.

Update:

You can get the same effect as the invalid syntax using a combination of a non-DISTINCT analytic query partitioned by customer and then aggregation by day:

Oracle Setup:

CREATE TABLE table_name ( customer, dt ) AS
  SELECT 1, DATE '2017-01-10' FROM DUAL UNION ALL
  SELECT 1, DATE '2017-01-11' FROM DUAL UNION ALL
  SELECT 1, DATE '2017-01-15' FROM DUAL UNION ALL
  SELECT 1, DATE '2017-01-20' FROM DUAL UNION ALL
  SELECT 2, DATE '2017-01-12' FROM DUAL UNION ALL
  SELECT 2, DATE '2017-01-19' FROM DUAL UNION ALL
  SELECT 3, DATE '2017-01-10' FROM DUAL UNION ALL
  SELECT 3, DATE '2017-01-13' FROM DUAL UNION ALL
  SELECT 3, DATE '2017-01-15' FROM DUAL UNION ALL
  SELECT 3, DATE '2017-01-20' FROM DUAL;

Query:

Note: the query below is only for one month's data and for a range of two days preceding to illustrate the principle but it is easy enough to change the parameters to 12 months and 10 days.

SELECT day,
       SUM( has_order_in_range ) AS unique_customers
FROM   (
  SELECT customer,
         day,
         LEAST(
           1,
           COUNT(dt) OVER ( PARTITION BY customer
                            ORDER BY day
                            RANGE BETWEEN INTERVAL '2' DAY PRECEDING
                                      AND INTERVAL '0' DAY FOLLOWING )
         ) AS has_order_in_range
  FROM   table_name t
         PARTITION BY ( customer )
         RIGHT OUTER JOIN
         ( -- Create a calendar for one month
           SELECT DATE '2017-01-01' + LEVEL - 1 AS day
           FROM   DUAL
           CONNECT BY DATE '2017-01-01' + LEVEL - 1 < ADD_MONTHS( DATE '2017-01-01', 1 )
         ) d
         ON ( t.dt = d.day )
)
GROUP BY day
ORDER BY day;

Output:

DAY                 UNIQUE_CUSTOMERS
------------------- ----------------
2017-01-01 00:00:00                0
2017-01-02 00:00:00                0
2017-01-03 00:00:00                0
2017-01-04 00:00:00                0
2017-01-05 00:00:00                0
2017-01-06 00:00:00                0
2017-01-07 00:00:00                0
2017-01-08 00:00:00                0
2017-01-09 00:00:00                0
2017-01-10 00:00:00                2
2017-01-11 00:00:00                2
2017-01-12 00:00:00                3
2017-01-13 00:00:00                3
2017-01-14 00:00:00                2
2017-01-15 00:00:00                2
2017-01-16 00:00:00                2
2017-01-17 00:00:00                2
2017-01-18 00:00:00                0
2017-01-19 00:00:00                1
2017-01-20 00:00:00                3
2017-01-21 00:00:00                3
2017-01-22 00:00:00                2
2017-01-23 00:00:00                0
2017-01-24 00:00:00                0
2017-01-25 00:00:00                0
2017-01-26 00:00:00                0
2017-01-27 00:00:00                0
2017-01-28 00:00:00                0
2017-01-29 00:00:00                0
2017-01-30 00:00:00                0
2017-01-31 00:00:00                0
MT0
  • 143,790
  • 11
  • 59
  • 117
  • so you think the only way is a triangular join / cross join? – barker Feb 16 '17 at 19:15
  • @barker Found a way with a combination of analytic and aggregate functions and a partition join... no Idea on how it will perform compared to an outer join on a range condition so you'll need to profile both querys and compare. – MT0 Feb 17 '17 at 11:40
  • I ran out of TEMP trying this query. But it's clever and nice, though. – peter.hrasko.sk Feb 17 '17 at 12:45
  • ah this might be the answer, i'll have to test to see which is faster in production. So the "least" function is I believe what makes this correct, as the preceding function will count the same customer multiple times, least will force the value to 1 or 0 correct? therefore the sum would be distinct. – barker Feb 17 '17 at 17:27
  • @barker That is correct, the combination of the `COUNT() OVER (...)` and `LEAST` is used to determine if there is a customer purchase in the range and then the `SUM` over those values makes it like `COUNT( DISTINCT ... ) OVER ( window )`. – MT0 Feb 17 '17 at 18:20
  • i've been looking all over the internet for the past week and could not find the solution to this problem anywhere, besides using triangular joins. i feel like more people need to know about this :) thanks! – barker Feb 19 '17 at 01:39
  • Nice work. This is actually very similar to the synopsis technique used to support incremental stats :) – BobC Feb 24 '17 at 01:22