1

The following problem occurred in our project, which we cannot solve. We have a huge data of our logs, and we go to ClickHouse from MongoDB.

Our table is created like this:

CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
    raw         String,
    ts          DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
    device_id   String        MATERIALIZED JSONExtractString(raw, 'device_id'),
    level       Int8          MATERIALIZED JSONExtractInt(raw, 'level'),
    context     String        MATERIALIZED JSONExtractString(raw, 'context'),
    event       String        MATERIALIZED JSONExtractString(raw, 'event'),
    event_code  String        MATERIALIZED JSONExtractInt(raw, 'event_code'),
    data        String        MATERIALIZED JSONExtractRaw(raw, 'data'),
    date        Date          DEFAULT toDate(ts),
    week        Date          DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week

and I'm running a query like so

SELECT device_id,toDateTime(ts),context,level,event,data 
FROM logs 
WHERE device_id = 'some_uuid'
ORDER BY ts DESC 
LIMIT 10 
OFFSET 0;

this is the result 10 rows in set. Elapsed: 6.23 sec.

And second without order, limit and offset:

SELECT device_id,toDateTime(ts),context,level,event,data 
FROM logs 
WHERE device_id = 'some_uuid'

this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+

Is too slow.

Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?

The same implementation on MongoDB takes 200-500ms max

  • Maybe you can try to extract raw from the where select, and then call JSONExtractString for params that you need – Emanuele Jul 05 '21 at 09:55

1 Answers1

1

Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?

I'm not sure how do you ingest your data, but let's focus on the reading part.

For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.

I would imagine these queries will do the scan faster:

SELECT device_id,toDateTime(ts),context,level,event,data 
FROM logs 
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC 
LIMIT 10 
OFFSET 0;

SELECT device_id,toDateTime(ts),context,level,event,data 
FROM logs 
WHERE device_id = 'some_uuid' AND week = '2021-07-05';

AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:

SELECT
    partition,
    name,
    active
FROM system.parts
WHERE table = 'logs'

So, if you want to store data in weekly parts, I would imagine partitioning could be like

...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)

This is also a good piece of information: Using Partitions and Primary keys in queries

Victor Perov
  • 1,697
  • 18
  • 37
  • Cool, it's work)) Thx!) This will probably solve our problem. – Egor Kurito Jul 05 '21 at 14:11
  • Some query become faster 200-500ms, some stayed slower 1-2s. Does it make sense to take a partitions more or less? – Egor Kurito Jul 05 '21 at 14:29
  • To answer why some of the queries take 1-2s you need to see how many parts CH tries to read to satisfy the query. You can try to use `EXPLAIN` syntax (added in the recent versions), or `--send_logs_level=trace` when you connect with you `clickhouse client`. But yes, mostly it's because CH has to touch more parts from the disk. – Victor Perov Jul 07 '21 at 13:55