Optimal primary key on this ClickHouse schema for aggregation

Question

I have a ClickHouse schema as following, MergeTree is in question:

(
hotel String,
staff_member String,
task_number Float64,
date DateTime
)
PRIMARY KEY (hotel, date)
ORDER BY (hotel, date)

My aggregation is as following:

SELECT
    staff_member,
    sum(task_number)
FROM ...
WHERE 
    hotel = {hotel}
AND date >= {first_date}
AND date <= {top_date}
GROUP BY staff_member

Basically, I'm aggregating the number of tasks of a staff member over a period of time, but the aggregation is kind of slow. I have a feeling the primary key is off and I need to rework it. First that comes to mind would be to change the key to (hotel, staff_member, date) since I'm grouping by the staff_member

I'm thankful for any help!

I would define more monotonic key: *(date, hotel)* (*date* come first). If these aggregates are required regularly then consider using precalculated aggregates - see [AggregatingMergeTree engine](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree/). (look at https://stackoverflow.com/a/62464976/303298). — vladimir, Mar 13 '21 at 23:37
@vladimir Why date first, if you could tell me a bit more detail? I need to be able to change the date window on will, so would the AggregatingMergeTree still fit this criteria? — Alek Yo, Mar 14 '21 at 20:18
If date first then you data will be almost sorted before insertion. But actually `(hotel, date)` is also good if it's common to have `WHERE hotel = {hotel}` in queries. In documentation [there are](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/#primary-keys-and-indexes-in-queries) similar example with `(CounterID, Date)`. — Deffe, Mar 16 '21 at 15:58

Optimal primary key on this ClickHouse schema for aggregation

0 Answers0