Streaming event data with dynamically changing schema into ClickHouse

Question

I am evaluating Clickhouse for storing streaming events having variable event schema. The event schema is nested JSON. The new properties could be of different data types, so we cannot create a normalized key-value pair.

Based on the documentation, it looks like whenever there is a change in the schema, we have to explicitly alter the schema of the table and then insert the new record. As the inserts are happening from multiple source workers, adding 'IF NOT EXISTS' clause can help minimizing the race conditions.

Can anyone advise if there is a better way to keep inserting events with dynamically changing schemas into Clickhouse? Is there any way to have Clickhouse infer the updated schema based on the event being inserted?

score 4 · Accepted Answer · answered Oct 11 '19 at 13:31

Each column on insert requires 2MB RAM.

1000 columns table will eat 2GB on insert.

Each column =~ 3 files on disk += 3 random read/write on insert.

Alter table sometimes will be blocked by merges/selects and could be executed hours.

I would pick out MAIN columns (which will be used in where most often) from JSON and put them into dedicated columns. Other columns into K/V - 2 columns (Key Array(String), Value Array(String)). Otherwise use MongoDb.

score 1 · Answer 2 · answered Dec 30 '20 at 09:58

What about using a fix schema in ClickHouse that would support any type in your events. For example using Nested columns or JSON columns. I tried this for a PoC using Nested columns with good results. First I identified 4 different data types: strings, numbers, booleans and dates.

The ClickHouse schema would look like this

CREATE TABLE custom_events
(
    // ... Normal columns needed depending on your use case
    attribute_string Nested
    (
        name String,
        value String
    ),
    attribute_number Nested
    (
        name String,
        value Float32
    ),
    attribute_boolean Nested
    (
        name String,
        value UInt8
    ),
    attribute_date Nested
    (
        name String,
        value Datetime('UTC')
    )
)
ENGINE = MergeTree()
ORDER BY (
   ...
);

The Nested columns will be converted into two columns of Array type by ClickHouse. To use this schema you would need to group the events fields by type and flatten them before proceeding to insert into ClickHouse. For example the following event in JSON

{
    "event_id": 1,
    "event_date": "2018-08-02 10:06:23",
    "data": {
        "rides": [
            {
                "km": 93,
                "vehicle": "plane",
                "people": 2.15,
                "finished": true,
                "date": "2019-06-24T18:52:34"
            }
        ]
    }
}

Could be flattened like this

{
    "event_id": 1,
    "event_date": "2018-08-02 10:06:23",
    "data.rides.0.km": 93,
    "data.rides.0.vehicle": "plane",
    "data.rides.0.people": 2.15,
    "data.rides.0.finished": true,
    "data.rides.0.date": "2019-06-24T18:52:34"
}

And then you can insert into ClickHouse like so

insert into custom_events VALUES (1, "2018-08-02 10:06:23", 
['data.rides.0.vehicle'], ['plane'], // This is Key and Value for strings
['data.rides.0.km', 'data.rides.0.km'], [93, 2.15] // This is Key and Value for numbers
... // do the same for each data type
)

Then you can query the data leveraging all the great high order functions provided by ClickHouse

SELECT
    arrayFilter(x -> ((x.1) = 1), arrayMap((x, y) -> (x LIKE 'data.rides.%.km', y), attribute_number.name, attribute_number.value)).2 AS element,
    event_date
FROM custom_events
WHERE (event_id = 1) AND notEmpty(element)

The query doesn't make sense as I redacted the fields a bit, but you can get the idea. This way you don't need to modify your schema and you can store any arbitrary JSON schema with the same fix schema in ClickHouse.

Hi, is this method efficient when query *involving these indices*? For example, if I have (a) a fixed schema with, say, 50 columns, and (b) the answer you show, then are the query performance the same? Thanks! — ch271828n, Jan 03 '21 at 09:16
In addition, how could I query those dynamic columns as if they are not nested? IMHO I will face difficulties if I want to do things like `WHERE data.rides.*.km > 1` or `GROUP BY ...` — ch271828n, Jan 03 '21 at 09:17
Not sure I understand what you want to do with 50 columns, the schema I proposed is based on data types, so I don't see how you could have 50 columns. Regardless you should give it a try and test. From my own test the performance is quite good and depending on what you are trying to do and which queries you need there are more things you could do like adding row skipping indexes or creating a MV — lloiacono, Jan 06 '21 at 07:40

Streaming event data with dynamically changing schema into ClickHouse

2 Answers2