0

I have a large dataset and want to retrieve previous and subsequent observations for small windows. Below is a snippet around a window of interest (2024-09-23 to 2024-09-27).

item    date    price
A   2024-09-20   $106.59 
A   2024-09-23   $91.84 
A   2024-09-24   $92.66 
A   2024-09-25   $107.87 
A   2024-09-26   $98.47 
A   2024-09-27   $91.65 
A   2024-09-30   $108.10 
B   2024-07-29   $120.71 
B   2024-08-13   $87.80 
B   2024-09-04   $115.80 
B   2024-09-26   $94.06 
B   2024-09-27   $120.88 

In the example above, "A" had a price every weekday, but "B" had infrequent prices. As expected (but not desired), the code below produces NULL values at the beginning and end of the desired window (specified in the WHERE statement).

SELECT item, date, price,
    lag(price) OVER (PARTITION BY item ORDER BY date) as prevprice,
    lead(price) OVER (PARTITION BY item ORDER BY date) as nextprice
FROM dailytable
WHERE date BETWEEN date'2024-09-23' AND date'2024-09-27'
;

Results - Not Wanted

item    date    price   prevprice   nextprice
A   2024-09-23   $91.84      NULL    $92.66 
A   2024-09-24   $92.66      $91.84      $107.87 
A   2024-09-25   $107.87     $92.66      $98.47 
A   2024-09-26   $98.47      $107.87     $91.65 
A   2024-09-27   $91.65      $98.47      NULL 
B   2024-09-26   $94.06      NULL    $120.88 
B   2024-09-27   $120.88     $94.06      NULL 

Obviously, I could run the LEAD() and LAG() over the entire dataset in a subquery, then select from that using the WHERE statement, as below.

SELECT * 
FROM 
(   SELECT item, date, price,
        lag(price) OVER (PARTITION BY item ORDER BY date) as prevprice,
        lead(price) OVER (PARTITION BY item ORDER BY date) as nextprice
    FROM dailytable
)
WHERE date BETWEEN date'2024-09-23' AND date'2024-09-27'
;

Desired Results

item    date    price   prevprice   nextprice
A   2024-09-23   $91.84      $106.59     $92.66 
A   2024-09-24   $92.66      $91.84      $107.87 
A   2024-09-25   $107.87     $92.66      $98.47 
A   2024-09-26   $98.47      $107.87     $91.65 
A   2024-09-27   $91.65      $98.47      $108.10 
B   2024-09-26   $94.06      $115.80     $120.88 
B   2024-09-27   $120.88     $94.06      NULL 

However, this is very costly and takes a long time. The entire dataset has over 50 years of data, but the required lead and lag periods are never nearly that long.

Is there a way to get the previous and subsequent prices for the window without running a subquery that runs over the entire history of the dataset? Of course, if an item doesn't have a next price, like B in the above example, I don't expect magic; NULL is appropriate.

Misha
  • 95
  • 6
  • Can you define a smaller window where you must expect to have prior/previous data? Filter on that. You might also be able to use a scalar subquery to defer the lookup to the rare cases where it's needed? – shawnt00 Aug 14 '23 at 21:25

1 Answers1

2

I suggest using a case expression that runs a correlated subquery when the prevprice is null, and this subquery looks for the immediately prior price despite the overall date range being applied. e.g:

SELECT
      item
    , DATE
    , price
    , CASE 
        WHEN prevprice IS NULL
            THEN (
                    select t.price FROM dailytable as t
                    where t.item = dailytable.item
                    and t.DATE < dailytable.DATE
                    order by t.DATE DESC
                    limit 1
                    )
        ELSE prevprice
        END AS prevprice
    , nextprice
FROM (
    SELECT
          item
        , DATE
        , price
        , lag(price) OVER (PARTITION BY item ORDER BY DATE) AS prevprice
        , lead(price) OVER (PARTITION BY item ORDER BY DATE) AS nextprice
    FROM dailytable
    WHERE DATE BETWEEN DATE '2024-09-23'
                   AND DATE '2024-09-27'
    ) subquery

nb I'm not sure about the trino syntax to limit an ordered subquery to the first row and cannot test the query above. Also note if Trino supported lateral joins (or an apply operator) I would rather use those, but as I understand it neither of these are available as of answering.

It is noteworthy that correlated subqueries aren't wonderful for performance but I suspect this will be better than scanning a table with 50 years of data.

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51