4

I have this table, and I want to add the values of 'change' column for several rows (Or, to be more exact from the row in which 'ne' value is zero up to the next row which includes zero for 'ne' (not the second one itself)). Any answer would be appreciated.

┌─rn─┬───────date─┬─ne─┬───────change─┐
│  0 │ 2008-12-07 │  0 │ -10330848398 │
│  1 │ 2009-04-14 │  1 │       -61290 │
│  2 │ 2009-04-26 │  1 │   9605743360 │
│  3 │ 2013-07-06 │  0 │ -32028871920 │
│  4 │ 2014-01-12 │  1 │ -42296164902 │
│  5 │ 2015-06-08 │  1 │  59100383646 │
└────┴────────────┴────┴──────────────┘

The result we expect is something like this.

row    start        end         sum(change) 
--------------------------------------------------
0 | 2008-12-07 | 2009-04-26 | -725,166,328
--------------------------------------------------
1 | 2013-07-06 | 2015-06-08 | -15,224,653,176
--------------------------------------------------
GMB
  • 216,147
  • 25
  • 84
  • 135
Commander
  • 311
  • 3
  • 16
  • Do you want to insert rows or you want to update? – Trevor Apr 11 '20 at 20:34
  • neither insert nor update. I just want to add the values of 'change' column between each zero value of 'ne' column. I need something like result of a select statement. @Trevor – Commander Apr 11 '20 at 20:48
  • 2
    Please show us the results that you expect, as tabular text. – GMB Apr 11 '20 at 21:02
  • 2
    I edited my question, now you can see what we expect. @GMB – Commander Apr 11 '20 at 21:10
  • Is this something that has to be done in SQL? I would probably use procedural code to do this if that was an option at all. – Peter Dongan Apr 11 '20 at 21:13
  • 1
    Our priority is to solve this challenge in clickhouse. but if you have any solution in sql, it could be informative. @NeutralHandle – Commander Apr 11 '20 at 21:16
  • You can begin by SELECT ne, MIN(date), MAX(date), SUM(change) GROUP BY ne – Trevor Apr 11 '20 at 21:18
  • 1
    This is normally handled with [window functions](https://mode.com/sql-tutorial/sql-window-functions/), but clickhouse seems to be lacking them. – Schwern Apr 11 '20 at 21:19
  • Apparently they are supported experimentally behind a flag. Versions from 2021 onwards are supposed to support window functions without the flag. – Milind R Jun 17 '22 at 07:17

5 Answers5

7

This is a gaps and islands problem. Canonical solutions do use window functions, which as far as concerns, Clickhouse does not support.

Here is one approach that uses a subquery to emulate a conditional window sum:

select
    min(date) start_date,
    max(date) end_date,
    sum(change) sum_change
from (
    select 
        t.*,
        (select count(*) from mytable t1 where t1.date <= t.date and t1.ne = 0) grp
    from mytable t
) t
group by grp

The subquery counts how many rows have ne = 0 from the first row of the table until the current row. This defines groups of records. Then all that is left to do is aggregate.

If you could use window functions, you would phrase this as:

select
    min(date) start_date,
    max(date) end_date,
    sum(change) sum_change
from (
    select 
        t.*,
        sum(case when ne = 0 then 1 else 0 end) over(order by date) grp
    from mytable t
) t
group by grp
GMB
  • 216,147
  • 25
  • 84
  • 135
  • Apparently this uses correlated subqueries which Clickhouse (at least 20.x versions) don't support (https://github.com/ClickHouse/ClickHouse/issues/25319). – Milind R Jun 15 '22 at 09:02
4

it's unsolvable in bigdata ( > 100 millions rows)

SELECT
    d[1] AS s,
    d[-1] AS e,
    arraySum(c) AS sm
FROM
(
    SELECT
        arraySplit((x, y) -> (NOT y), d, n) AS dd,
        arraySplit((x, y) -> (NOT y), c, n) AS cc
    FROM
    (
        SELECT
            groupArray(date) AS d,
            groupArray(ne) AS n,
            groupArray(change) AS c
        FROM
        (
            SELECT *
            FROM mytable
            ORDER BY rn ASC
        )
    )
)
ARRAY JOIN
    dd AS d,
    cc AS c

┌─s──────────┬─e──────────┬───────────sm─┐
│ 2008-12-07 │ 2009-04-26 │   -725166328 │
│ 2013-07-06 │ 2015-06-08 │ -15224653176 │
└────────────┴────────────┴──────────────┘
Denny Crane
  • 11,574
  • 2
  • 19
  • 30
3

Just another way to resolve this task:

WITH (SELECT arraySort(groupArray(rn))
    FROM test_table
    WHERE ne = 0) as group_start_id
SELECT argMin(date, rn) start, argMax(date, rn) end, sum(change)
FROM (
    SELECT rn, date, change
    FROM test_table
    ORDER BY rn)
GROUP BY arrayFirstIndex(x -> rn < x, group_start_id)   
ORDER BY start

Testing on sample data:

WITH (SELECT arraySort(groupArray(rn))
    FROM (
        SELECT data.1 rn, data.2 date, data.3 ne, data.4 change
        FROM (
            SELECT arrayJoin([
            (0, toDate('2008-12-07'), 0, toInt64(-10330848398)),
            (1, toDate('2009-04-14'), 1, toInt64(-61290)),
            (2, toDate('2009-04-26'), 1, toInt64(9605743360)),
            (3, toDate('2013-07-06'), 0, toInt64(-32028871920)),
            (4, toDate('2014-01-12'), 1, toInt64(-42296164902)),
            (5, toDate('2015-06-08'), 1, toInt64(59100383646)),
            (6, toDate('2015-06-08'), 0, toInt64(101)),
            (7, toDate('2015-06-09'), 0, toInt64(102)),
            (8, toDate('2015-06-10'), 0, toInt64(103)),
            (9, toDate('2015-06-11'), 1, toInt64(104))
            ]) data))
    WHERE ne = 0) as group_start_id
SELECT argMin(date, rn) start, argMax(date, rn) end, sum(change)
FROM (
    SELECT data.1 rn, data.2 date, data.4 change
    FROM (
        SELECT arrayJoin([
        (0, toDate('2008-12-07'), 0, toInt64(-10330848398)),
        (1, toDate('2009-04-14'), 1, toInt64(-61290)),
        (2, toDate('2009-04-26'), 1, toInt64(9605743360)),
        (3, toDate('2013-07-06'), 0, toInt64(-32028871920)),
        (4, toDate('2014-01-12'), 1, toInt64(-42296164902)),
        (5, toDate('2015-06-08'), 1, toInt64(59100383646)),
        (6, toDate('2015-06-08'), 0, toInt64(101)),
        (7, toDate('2015-06-09'), 0, toInt64(102)),
        (8, toDate('2015-06-10'), 0, toInt64(103)),
        (9, toDate('2015-06-11'), 1, toInt64(104))
        ]) data)
    ORDER BY rn)
GROUP BY arrayFirstIndex(x -> rn < x, group_start_id)   
ORDER BY start
/* result
┌──────start─┬────────end─┬──sum(change)─┐
│ 2008-12-07 │ 2009-04-26 │   -725166328 │
│ 2013-07-06 │ 2015-06-08 │ -15224653176 │
│ 2015-06-08 │ 2015-06-08 │          101 │
│ 2015-06-09 │ 2015-06-09 │          102 │
│ 2015-06-10 │ 2015-06-11 │          207 │
└────────────┴────────────┴──────────────┘
*/
vladimir
  • 13,428
  • 2
  • 44
  • 70
0

SELECT ne, MIN(date) as start, MAX(date) as end, SUM(change) as change GROUP BY ne

Trevor
  • 177
  • 5
0

Assuming Clickhouse supports variables:

set @block := -1;
select 
    block as row,
    min(date) as start,
    max(date) as end,
    sum(change)
from
    (select  
        case when ne = 0 then @block:=@block+1 end as dummy,
        @block as block,
        t.*
    from t) tt
group by block;
triclosan
  • 5,578
  • 6
  • 26
  • 50