2

I'm looking for an efficient way on how to query the n past values as array in ClickHouse for each row ordered by one column (i.e. Time), where the values should be retrieved as array.

Window functions are still not supported in ClickHouse (see #1469), so I was hoping for a work-around using aggregation functions like groupArray()?

Table:

Time  | Value
12:11 | 1
12:12 | 2
12:13 | 3
12:14 | 4
12:15 | 5
12:16 | 6

Expected result with a window of size n=3:

Time  | Value
12:13 | [1,2,3]
12:14 | [2,3,4]
12:15 | [3,4,5]
12:16 | [4,5,6]

What are the ways/functions currently used in ClickHouse to efficiently query a sliding/moving window and how can I achieve my desired result?

EDIT:

My solution based on response of @vladimir:

select max(Time) as Time, groupArray(Value) as Values
from (
    select
        *,
        rowNumberInAllBlocks() as row_number,
        arrayJoin(range(row_number, row_number + 3)) as window_id
    from (
        /* BEGIN emulate origin dataset */
        select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
        from (
            select arrayJoin([
                '2020-01-01 12:11:00',
                '2020-01-01 12:12:00',
                '2020-01-01 12:13:00',
                '2020-01-01 12:14:00',
                '2020-01-01 12:15:00',
                '2020-01-01 12:16:00']) a
        )
        order by Time
        /* END emulate origin dataset */
    )
    order by Time
) s
group by window_id
having length(Values) = 3
order by Time

Note that 3 appears twice in the query and represents the window size n.

Output:

┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
1RB
  • 23
  • 1
  • 4

2 Answers2

2

Starting from version 21.4 added the full support of window-functions. At this moment it was marked as an experimental feature.

SELECT
    Time,
    groupArray(any(Value)) OVER (ORDER BY Time ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Values
FROM 
(
    /* Emulate the test dataset, */
    select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
    from (
        select arrayJoin([
            '2020-01-01 12:11:00',
            '2020-01-01 12:12:00',
            '2020-01-01 12:13:00',
            '2020-01-01 12:14:00',
            '2020-01-01 12:15:00',
            '2020-01-01 12:16:00']) a
    )
    order by Time
)
GROUP BY Time
SETTINGS allow_experimental_window_functions = 1

/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1]     │
│ 2020-01-01 12:12:00 │ [1,2]   │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/

See https://altinity.com/blog/clickhouse-window-functions-current-state-of-the-art.


ClickHouse has several datablock-scoped window functions, let's take neighbor:

SELECT Time, [neighbor(Value, -2), neighbor(Value, -1), neighbor(Value, 0)] Values
FROM (
  /* emulate origin data */
  SELECT toDateTime(data.1) as Time, data.2 as Value
  FROM (
    SELECT arrayJoin([('2020-01-01 12:11:00', 1),
    ('2020-01-01 12:12:00', 2),
    ('2020-01-01 12:13:00', 3),
    ('2020-01-01 12:14:00', 4),
    ('2020-01-01 12:15:00', 5),
    ('2020-01-01 12:16:00', 6)]) as data)
  )

/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [0,0,1] │
│ 2020-01-01 12:12:00 │ [0,1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘

*/

An alternate way based on the duplication of source rows by window_size times:

SELECT   
  arrayReduce('max', arrayMap(x -> x.1, raw_result)) Time,
  arrayMap(x -> x.2, raw_result) Values
FROM (  
  SELECT groupArray((Time, Value)) raw_result, max(row_number) max_row_number
  FROM (
    SELECT 
      3 AS window_size,
      *, 
      rowNumberInAllBlocks() row_number,
      arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
    FROM (
      /* emulate origin dataset */
      SELECT toDateTime(data.1) as Time, data.2 as Value
      FROM (
        SELECT arrayJoin([('2020-01-01 12:11:00', 1),
          ('2020-01-01 12:12:00', 2),
          ('2020-01-01 12:13:00', 3),
          ('2020-01-01 12:14:00', 4),
          ('2020-01-01 12:15:00', 5),
          ('2020-01-01 12:16:00', 6)]) as data)
      ORDER BY Value
      )
    )
  GROUP BY window_id
  HAVING max_row_number = window_id
  ORDER BY window_id
  )
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1]     │
│ 2020-01-01 12:12:00 │ [1,2]   │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/

Extra example:

SELECT   
  arrayReduce('max', arrayMap(x -> x.1, raw_result)) id,
  arrayMap(x -> x.2, raw_result) values
FROM (  
  SELECT groupArray((id, value)) raw_result, max(row_number) max_row_number
  FROM (
    SELECT 
      48 AS window_size,
      *, 
      rowNumberInAllBlocks() row_number,
      arrayJoin(arrayMap(x -> x + row_number, range(window_size))) window_id
    FROM (
      /* the origin dataset */
      SELECT number AS id, number AS value
      FROM numbers(4096) 
      )
    )
  GROUP BY window_id
  HAVING max_row_number = window_id
  ORDER BY window_id
  )
/*
┌─id─┬─values────────────────┐
│  0 │ [0]                   │
│  1 │ [0,1]                 │
│  2 │ [0,1,2]               │
│  3 │ [0,1,2,3]             │
│  4 │ [0,1,2,3,4]           │
│  5 │ [0,1,2,3,4,5]         │
│  6 │ [0,1,2,3,4,5,6]       │
│  7 │ [0,1,2,3,4,5,6,7]     │
..
│ 56 │ [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56]  │
│ 57 │ [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57] │
│ 58 │ [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58] │
│ 59 │ [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59] │
│ 60 │ [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60] │
..
│ 4093 │ [4046,4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093] │
│ 4094 │ [4047,4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094] │
│ 4095 │ [4048,4049,4050,4051,4052,4053,4054,4055,4056,4057,4058,4059,4060,4061,4062,4063,4064,4065,4066,4067,4068,4069,4070,4071,4072,4073,4074,4075,4076,4077,4078,4079,4080,4081,4082,4083,4084,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095] │
└──────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
*/
vladimir
  • 13,428
  • 2
  • 44
  • 70
  • this is indeed a possible solution for small window sizes, but is there a generic neighbor function which returns an array of values until the specified offset, instead of only the value at offset? – 1RB Nov 08 '20 at 00:18
  • you are right, I extended the answer with a more flexible way (it allows using any window size) but a little confusing. – vladimir Nov 08 '20 at 02:17
  • Thanks for the "creative" solutions. This is definitely something ClickHouse should provide as a built-in feature--sliding windows for time series is a pretty basic requirement IMO. I have modified your solution, see my post. – 1RB Nov 08 '20 at 03:14
0

For Clickhouse 19, where range function takes only single input, you can use following query

select max(Time) as Time, groupArray(Value) as Values
from (
select
        *,
        rowNumberInAllBlocks() as row_number,
        arrayJoin( arrayMap(x -> x + row_number, range(3)) ) as window_id
    from (
        /* BEGIN emulate origin dataset */
        select toDateTime(a) as Time, rowNumberInAllBlocks()+1 as Value
        from (
            select arrayJoin([
                '2020-01-01 12:11:00',
                '2020-01-01 12:12:00',
                '2020-01-01 12:13:00',
                '2020-01-01 12:14:00',
                '2020-01-01 12:15:00',
                '2020-01-01 12:16:00']) a
        )
        order by Time
        /* END emulate origin dataset */
    )
order by Time
) s
group by window_id
having length(Values) = 3
order by Time
Flair
  • 2,609
  • 1
  • 29
  • 41
Nab
  • 1