4

I have array of events. Each event represented as tuple and contain session_id and datetime:

[
 ('aa', '2020-11-08 01:00:01'),
 ('aa', '2020-11-08 01:00:03'),
 ('aa', '2020-11-08 01:00:05'),
 ('ab', '2020-11-09 01:00:00'),
 ('ab', '2020-11-09 01:00:05'),
 ('ab', '2020-11-09 01:00:15')
]

I need to calculate average session time using this array.

So I need to convert this array to new array [(session_id, min(datetime), max(datetime))]

[
 ('aa', '2020-11-08 01:00:01', '2020-11-08 01:00:05'),
 ('ab', '2020-11-09 01:00:00', '2020-11-09 01:00:15')
]

Then calculate session_time for each session_id [(session_id, max(datetime) - min(datetime))]

[
 ('aa', 4),
 ('ab', 15)
]

And then calculate average session time ((4+15)/2) = 9.5

What is the best way to do that?

nikopol
  • 41
  • 3

2 Answers2

1

To get the desired result I would use relation representation of data instead of the array.

arrayJoin helps transform array to a relation:

SELECT avg(duration)
FROM 
(
    SELECT max(time) - min(time) AS duration
    FROM 
    (
        SELECT 
            data.1 AS id,
            toDateTime(data.2) AS time
        FROM 
        (
            SELECT arrayJoin([('aa', '2020-11-08 01:00:01'), ('aa', '2020-11-08 01:00:03'), ('aa', '2020-11-08 01:00:05'), ('ab', '2020-11-09 01:00:00'), ('ab', '2020-11-09 01:00:05'), ('ab', '2020-11-09 01:00:15')]) AS data
        )
    )
    GROUP BY id
)
/*
┌─avg(duration)─┐
│           9.5 │
└───────────────┘
*/

Array-based decision. Take into account it can be significantly slower than a relation-based one (check both of them before choosing the best one). This implementation can be improved by using arrayReduceInRanges-function.

SELECT 
    arraySort(x -> (x.1), data) AS sorted_array,
    arraySplit((x, y) -> y, sorted_array, arrayMap((x, i) -> if(i = 1, 1, if((x.1) = ((sorted_array[i - 1]).1), 0, 1)), sorted_array, arrayEnumerate(sorted_array))) AS session_arrays,
    arrayMap(arr -> arrayReduce('min', arrayMap(x -> (x.2), arr)), session_arrays) AS min_session_times,
    arrayMap(arr -> arrayReduce('max', arrayMap(x -> (x.2), arr)), session_arrays) AS max_session_times,
    arrayReduce('avg', arrayMap((x, y) -> (y - x), min_session_times, max_session_times)) AS avg
FROM 
(
    SELECT [('aa', toDateTime('2020-11-08 01:00:01')), ('aa', toDateTime('2020-11-08 01:00:03')), ('aa', toDateTime('2020-11-08 01:00:05')), ('ab', toDateTime('2020-11-09 01:00:00')), ('ab', toDateTime('2020-11-09 01:00:05')), ('ab', toDateTime('2020-11-09 01:00:15'))] AS data
)
/*
Row 1:
──────
sorted_array:      [('aa','2020-11-08 01:00:01'),('aa','2020-11-08 01:00:03'),('aa','2020-11-08 01:00:05'),('ab','2020-11-09 01:00:00'),('ab','2020-11-09 01:00:05'),('ab','2020-11-09 01:00:15')]
session_arrays:    [[('aa','2020-11-08 01:00:01'),('aa','2020-11-08 01:00:03'),('aa','2020-11-08 01:00:05')],[('ab','2020-11-09 01:00:00'),('ab','2020-11-09 01:00:05'),('ab','2020-11-09 01:00:15')]]
min_session_times: ['2020-11-08 01:00:01','2020-11-09 01:00:00']
max_session_times: ['2020-11-08 01:00:05','2020-11-09 01:00:15']
avg:               9.5
*/
vladimir
  • 13,428
  • 2
  • 44
  • 70
1

minMap(data.1, data.2)

select minMap(data.1, data.2) from (
SELECT [('aa', '2020-11-08 01:00:01'), 
        ('aa', '2020-11-08 01:00:03'), 
        ('aa', '2020-11-08 01:00:05'), 
        ('ab', '2020-11-09 01:00:00'), 
        ('ab', '2020-11-09 01:00:05'), 
        ('ab', '2020-11-09 01:00:15')] AS data)

┌─minMap(tupleElement(data, 1), tupleElement(data, 2))────────┐
│ (['aa','ab'],['2020-11-08 01:00:01','2020-11-09 01:00:00']) │
└─────────────────────────────────────────────────────────────┘

select minMap(data.1, data.2).2 as x, maxMap(data.1, data.2).2 as y ,
   arrayMap(i,j -> toDateTime(j)-toDateTime(i), x,y) r,
   arrayReduce('avg', r) z
from (
SELECT [('aa', '2020-11-08 01:00:01'), 
        ('aa', '2020-11-08 01:00:03'), 
        ('aa', '2020-11-08 01:00:05'), 
        ('ab', '2020-11-09 01:00:00'), 
        ('ab', '2020-11-09 01:00:05'), 
        ('ab', '2020-11-09 01:00:15')] AS data)
        
┌─x─────────────────────────────────────────────┬─y─────────────────────────────────────────────┬─r──────┬───z─┐
│ ['2020-11-08 01:00:01','2020-11-09 01:00:00'] │ ['2020-11-08 01:00:05','2020-11-09 01:00:15'] │ [4,15] │ 9.5 │
└───────────────────────────────────────────────┴───────────────────────────────────────────────┴────────┴─────┘
Denny Crane
  • 11,574
  • 2
  • 19
  • 30
  • Thank you for your answer. Clickhouse is awesome. But without great community no one would know about it. – nikopol Nov 11 '20 at 00:47