1

I'm having two ( and more) Kafka topics and I need to join them. My question from what I read on blogs/StackOverflow.... two option:

1) stream them both, Clickhouse Kafka engine/spark streaming, to a sperate tables and then run join which is not recommended in Clickhouse?

2) build one table with all columns and use Clickhouse Engine/spark streaming to update the same entrance?

Any advice

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
Arnon Rodman
  • 511
  • 2
  • 6
  • 21

2 Answers2

1

As always it really depends what kind of data you import and how you are going to use it, but I would say that in most cases it is better to import the 2 topics into a single table (so option 2). From there you will be able to quickly filter and aggregate the records. Depending on the queries you want to do, you should import the data using an appropriate ORDER BY columns, which will make your queries much faster.

If you give more details about the schema of the data you want to join, I can be more specific with the answer.

valo
  • 1,712
  • 17
  • 12
  • Thx for the replay: If I have two topics A{x,y,z,ts} and topic B{w,x,v,ts} , and I created one table : columns x,y,z,w,v,A_ts,B_ts does CH update the same record with x value? Meaning if we get first a message from topic A and after 5-10 min a message from topic B with the same value at x filed does CH update the first insert? – Arnon Rodman Mar 10 '19 at 07:46
  • @Arnon Rodman it is not possible to do in CH. To merge events from different topics need to use some preprocessing based on Kafka Streams or Spark and then merged ones pass to CH. I would recommend asking Kafka Stream/Spark community. – vladimir Mar 13 '19 at 22:20
  • @vladimirG Thx for the replay, my idea/thought was to create manually "upsert/update", meaning reading from two Kafka topic ( spark or other CH Kafka engine ) and create CH sink but with SQL update command. My CH table will have x,y,z,w,v.. – Arnon Rodman Mar 15 '19 at 08:17
  • @ArnonRodman you can use a ReplacingMergeTree and specify `x` in the `ORDER BY`. There is also a `ver` parameter, which allows you to specify a version column, that will be used to keep the records with the largest version. Keep in mind that the duplicates are not removed immediately. If you run `OPTIMIZE TABLE ... FINAL` the duplicates will be removed, but it might take some time for that optimize to complete. – valo Apr 01 '19 at 16:15
1

The standard way to get data from Kafka in ClickHouse is to create 'source' table with Engine=Kafka and Materialized view which will copy data to final table with ReplicatedMergeTree engine.

You can create multiple materialized views which will write to the same target table, just like that.


CREATE TABLE kafka_topic1 ( ... ) Engine=Kafka ...;

CREATE TABLE kafka_topic2 ( ... ) Engine=Kafka ...;

CREATE TABLE clickhouse_table ( ... ) Engine=MergeTree ...;

CREATE MATERIALIZED VIEW kafka_topic1_reader
  TO clickhouse_table
  AS SELECT * FROM kafka_topic1;

CREATE MATERIALIZED VIEW kafka_topic2_reader
  TO clickhouse_table
  AS SELECT * FROM kafka_topic2;
filimonov
  • 1,666
  • 1
  • 9
  • 20
  • Thx for the replay, does the final table clickhouse_table will contain data from both Kafakn engine? "raws" with the same key will be merged or I need to create my own manually job that will "join/merge"? – Arnon Rodman Mar 15 '19 at 08:32
  • By default regular MergeTree just store all inserted raws, no matter if it has duplicates or not, and no matter how data was inserted (by materialized view or directly). So you will have all rows from both topics (with duplicates if they will appear). Deduplication is another (quite big) topic. – filimonov Mar 15 '19 at 13:10
  • CREATE TABLE kafka_topic1 (x,y,z ... ) Engine=Kafka ...; CREATE TABLE kafka_topic2 ( x,w,f... ) Engine=Kafka ...; CREATE TABLE clickhouse_table (x,y,z,w Nullable(Int64),f ... ) Engine=MergeTree ...; CREATE MATERIALIZED VIEW kafka_topic1_reader TO clickhouse_table AS SELECT * FROM kafka_topic1; CREATE MATERIALIZED VIEW kafka_topic2_reader TO clickhouse_table AS SELECT * FROM kafka_topic2; – Arnon Rodman Mar 20 '19 at 13:30
  • When I"m createing the first "CREATE MATERIALIZED VIEW kafka_topic1_reader" I"m getting an error on "w" filed "Expected column w: while pushing to view tmp1. clickhouse_table". Is it possible to create clickhouse_table with cloumns from both kafka topic and each diffrent "MATERIALIZED VIEW" will create raws with its topic cloumns – Arnon Rodman Mar 20 '19 at 13:31
  • That's quite weird (i mean putting the data with different columns set to one table). But sure, you can. – filimonov Mar 20 '19 at 15:05
  • in your example I see two different Kafka topics engines with two MATERIALIZED VIEW updating the same clickhouse_table. For that clickhouse_table need to contains all fields form both topics. Am I wrong? – Arnon Rodman Mar 21 '19 at 07:52
  • I thought that you just need to combine data of the same nature coming from different sources. If it is like that and only a few columns differ - of course, your target table should have columns from both topics. But if the set of columns differs significantly (in your sample only one column is the same) - maybe it would better to store the data in 2 different tables (doing unions if necessary), instead of having 1/2 of columns empty for 1/2 of rows. – filimonov Mar 22 '19 at 08:07
  • Can you help on clickhouse import from CSV memory issue?https://stackoverflow.com/questions/56156682/clickhouse-import-data-from-csv-dbnetexception-connection-reset-by-peer-whil – Arnon Rodman May 21 '19 at 08:23