0

I'm importing data into a ClickHouse table from CSV files.

cat data.csv | clickhouse-client --config-file=config.xml --query="INSERT INTO data_pool FORMAT CSVWithNames"

Often CSV files contain duplicate entries that are already in the ClickHouse table. What is the most efficient way to insert new data from a CSV file, skipping the entries already in the table?

Greg
  • 137
  • 2
  • 11

1 Answers1

0

Something like that

cat data.csv | clickhouse-client --echo --mn --config-file=config.xml --query="CREATE TABLE tmp_data_pool LIKE data_pool ENGINE=ReplcaingMergeTree() ORDER BY f1,f2,f3...fieldN; INSERT INTO tmp_data_pool FORMAT CSVWithNames; INSERT INTO data_pool SELECT * FROM tmp_data_pool FINAL; DROP TABLE tmp_data_pool SYNC"

look details about FINAL and ReplcacingMergeTree() https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree/

https://clickhouse.com/docs/en/operations/settings/settings#max-final-threads and https://github.com/ClickHouse/ClickHouse/pull/19375

to improve FINAL performance optimizations

Slach
  • 1,672
  • 13
  • 21
  • Thank you for the answer, but it does not actually solve my problem. The main issue is to skip the duplicates from CSV that are already in the data_pool table. The table data_pool can have the following: ``` col1 col2 col3 2 city2 value2 3 city3 value3 ``` and the CSV file can have the following: ``` col1 col2 col3 0 city0 value0 2 city2 value2 4 city4 value ``` So I want to import from the CSV only two entries that are not in the data_pool: ``` 0 city0 value0 4 city4 value ``` – Greg Feb 06 '23 at 11:41
  • ok. after create TEMPORARY TABLE you can try do following query `INSERT INTO table SELECT t1.* FROM tmp_table AS t1 LEFT JOIN table AS t2 ON t1.col1=t1.col1 AND ... t1.colN = t2.colN WHERE t2.col1 IS NULL SETTINGS join_use_nulls=1, join_algorithm='auto'` – Slach Feb 07 '23 at 07:51