5

I have a database table containing one-minute periods of Open, Close, High, Low, Volume values for a security. I'm using SQL Server 2017, but 2019 RC is an option.

I am trying to find an efficient SQL Server query that can aggregate these into 5-minute windows, where:

  • Open = first Open value of the window
  • Close = last Close value of the window
  • High = max High value of the window
  • Low = min Low value of the window
  • Volume = avg Volume across the window

Ideally this query would account for gaps in the data, i.e. be based on date calculations rather than counting preceding / following rows.

For example say I have (here's 6 mins of data):

| Time             | Open | Close | High | Low | Volume |
|------------------|------|-------|------|-----|--------|
| 2019-10-30 09:30 | 5    | 10    | 15   | 1   | 125000 |
| 2019-10-30 09:31 | 10   | 15    | 20   | 5   | 100000 |
| 2019-10-30 09:32 | 15   | 20    | 25   | 10  | 120000 |
| 2019-10-30 09:33 | 20   | 25    | 30   | 15  | 10000  |
| 2019-10-30 09:34 | 20   | 22    | 40   | 2   | 13122  |
| 2019-10-30 09:35 | 22   | 30    | 35   | 4   | 15000  | Not factored in, since this would be the first row of the next 5-minute window

I am trying to write a query that would give me (here's the first example of the 5-minute aggregate):

| Time             | Open | Close | High | Low | Volume  |
|------------------|------|-------|------|-----|---------|
| 2019-10-30 09:30 | 5    | 30    | 40   | 1   | 50224.4 |

Any tips? Am banging my head against the wall with the OVER clause and its PARTITION / RANGE options

Dale K
  • 25,246
  • 15
  • 42
  • 71
Alex Norcliffe
  • 2,439
  • 2
  • 17
  • 21

3 Answers3

5

The gist of the problem is rounding datetime values to 5 minute boundary which (assuming that the datatype is datetime) could be done using DATEADD(MINUTE, DATEDIFF(MINUTE, 0, time) / 5 * 5, 0). Rest is basic grouping/window functions:

WITH cte AS (
  SELECT clamped_time
       , [Open]
       , [Close]
       , [High]
       , [Low]
       , [Volume]
       , rn1 = ROW_NUMBER() OVER (PARTITION BY clamped_time ORDER BY [Time])
       , rn2 = ROW_NUMBER() OVER (PARTITION BY clamped_time ORDER BY [Time] DESC)
  FROM t
  CROSS APPLY (
      SELECT DATEADD(MINUTE, DATEDIFF(MINUTE, 0, time) / 5 * 5, 0)
  ) AS x(clamped_time)
)
SELECT clamped_time
     , MIN(CASE WHEN rn1 = 1 THEN [Open] END) AS [Open]
     , MIN(CASE WHEN rn2 = 1 THEN [Close] END) AS [Close]
     , MAX([High]) AS [High]
     , MIN([Low]) AS [Low]
     , AVG([Volume])
FROM cte
GROUP BY clamped_time

Demo on db<>fiddle

Salman A
  • 262,204
  • 82
  • 430
  • 521
  • Thank you so much! It's so fast!! I had got this working from 2015 but it took 3 seconds across a few thousand rows for one security http://www.sqlfingers.com/2015/06/querypenhighlowclose-tick-price-data.html - your solution generates equivalent values in 2 seconds across 500k rows (or 100ms per security, which is how it will be run). Thanks! – Alex Norcliffe Nov 04 '19 at 08:46
  • OOC what was the reasoning why you edited it to include the rn1/rn2 for calculating the Open/Close? So that I can understand. I tried shortcutting by having this instead of the CASE statements: LAST_VALUE([Close]) OVER (PARTITION BY clamped_time ORDER BY t.PeriodOpen DESC) but I'm assuming you had a reason for not doing that? – Alex Norcliffe Nov 04 '19 at 08:49
  • @user12319070 it is much simpler with row_number. With `LAST_VALUE` you need a `ROWS BETWEEN` clause which is something I am not familiar with. But you're welcome to use it if creates a better plan. Re speed: if you want it faster I suggest creating a persisted, computed column to store the 5 minute boundary and create an index on `(x_time, time) include (open, close, ...)`. – Salman A Nov 04 '19 at 09:49
3

You want to analyze data by 5 minutes intervals. You could use window functions with the following partitioning clause:

partition by datepart(year, t.[time]),
    datepart(month, t.[time]),
    datepart(day, t.[time]),
    datepart(hour, t.[time]),
    (datepart(minute, t.[time]) / 5)

Query:

select *
from (
    select  
        t.time,
        row_number() over(
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
            order by [time]
        ) [rn],
        first_value([open]) over(
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
            order by [time]
        ) [open],
        last_value([close]) over(
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
            order by [time]
        ) [close],
        max([high]) over (
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
        ) [high],
        min([low]) over (
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
        ) [low],
        avg([volume]) over (
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
        ) [volume]
    from mytable t
) t
where rn = 1
GMB
  • 216,147
  • 25
  • 84
  • 135
  • Thank you very much! I tried your version too, and it definitely helped me understand partitioning logic for the over clause. It also generates fine results, though the accepted answer appears to be slightly faster to execute and is shorter. Tough choice tho! – Alex Norcliffe Nov 04 '19 at 08:48
  • 1
    I think that you made the good choice, I find that the solution by @SalmanA is actuallly better than mine, because the definition of the partition is more concise. – GMB Nov 04 '19 at 13:51
0

you can try this.

  SELECT
      MIN([Time]) [Time], 
      Min([Open]) [Open],
      LEAD(Min([Open])) OVER (ORDER BY MIN([Time])) AS [Close],
      Max([High]) [High], 
      Min([Low]) [Low], 
      Avg(Volume) Volume
  FROM SampleData
  GROUP BY DATEADD(Minute, -1* DATEPART(Minute, Time) %5, Time)

sql fiddle

junior_dev
  • 95
  • 1
  • 7