Insert data into a ClickHouse table from a file with deduplication

Question

I'm importing data into a ClickHouse table from CSV files.

cat data.csv | clickhouse-client --config-file=config.xml --query="INSERT INTO data_pool FORMAT CSVWithNames"

Often CSV files contain duplicate entries that are already in the ClickHouse table. What is the most efficient way to insert new data from a CSV file, skipping the entries already in the table?

Slach · Answer 1 · 2023-02-05T14:33:54.180

0

Something like that

cat data.csv | clickhouse-client --echo --mn --config-file=config.xml --query="CREATE TABLE tmp_data_pool LIKE data_pool ENGINE=ReplcaingMergeTree() ORDER BY f1,f2,f3...fieldN; INSERT INTO tmp_data_pool FORMAT CSVWithNames; INSERT INTO data_pool SELECT * FROM tmp_data_pool FINAL; DROP TABLE tmp_data_pool SYNC"

look details about FINAL and ReplcacingMergeTree() https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree/

https://clickhouse.com/docs/en/operations/settings/settings#max-final-threads and https://github.com/ClickHouse/ClickHouse/pull/19375

to improve FINAL performance optimizations

edited Feb 05 '23 at 14:33

answered Feb 05 '23 at 14:22

Slach

1,672
13
21

Thank you for the answer, but it does not actually solve my problem. The main issue is to skip the duplicates from CSV that are already in the data_pool table. The table data_pool can have the following: ``` col1 col2 col3 2 city2 value2 3 city3 value3 ``` and the CSV file can have the following: ``` col1 col2 col3 0 city0 value0 2 city2 value2 4 city4 value ``` So I want to import from the CSV only two entries that are not in the data_pool: ``` 0 city0 value0 4 city4 value ``` – Greg Feb 06 '23 at 11:41
ok. after create TEMPORARY TABLE you can try do following query `INSERT INTO table SELECT t1.* FROM tmp_table AS t1 LEFT JOIN table AS t2 ON t1.col1=t1.col1 AND ... t1.colN = t2.colN WHERE t2.col1 IS NULL SETTINGS join_use_nulls=1, join_algorithm='auto'` – Slach Feb 07 '23 at 07:51

Insert data into a ClickHouse table from a file with deduplication

1 Answers1