4

I'm trying to load *.gz file to Clickhouse through: clickhouse-client --max_memory_usage=15323460608 --format_csv_delimiter="|" --query="INSERT INTO tmp1.my_test)table FORMAT CSV"

I"m getting the error: Code: 210. DB::NetException: Connection reset by peer, while writing to socket (127.0.0.1:9000) .

No errors in clickhouse-server.log , clickhouse-server.err.log or zookeeper.log

When I run the insert command I see the memory is getting almost the limit of the server ( 32Gb) this is why I tried to limit it through max_memory_usage , the same error

Any ideas? Thanks in advance

Arnon Rodman
  • 511
  • 2
  • 6
  • 21
  • That is very weird, most probably you have something misconfigured. May be use have some extra long strings in your sourse data (like megabytes of data per one value)? Does insert work if you insert small number of rows? Like few hundreds / thousands? – filimonov May 25 '19 at 11:09
  • Hi and Thx for the replay. Yes when I try to insert small amount it's working ( head 5000 of 35M) is working fine – Arnon Rodman May 26 '19 at 07:36
  • Can you provide minimal reproducible sample? https://stackoverflow.com/help/minimal-reproducible-example – filimonov May 26 '19 at 20:00
  • CREATE TABLE tmp1.shit_happens ( x Float32, y Float32, z Float64, extra_data_x String, guid String, ipAddress String, business_ts DateTime, user_id Int64, insert_ts DateTime) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test_cluster_two_shards/shit_happens', '10.146.148.83') PARTITION BY toYYYYMMDD(business_ts) PRIMARY KEY (toYYYYMMDD(buissnes_ts), guid) ORDER BY (toYYYYMMDD(spin_ts), guid) SETTINGS index_granularity = 8192 – Arnon Rodman May 27 '19 at 06:44
  • 4 Clickhouse servers 3 zookeeper nodes, I configured test_cluster_two_shards user is "default". 1 file exported from Vertica CB 35M records one_hour_business.gz size 1.8GB This one works :head -500000 one_hour_business | clickhouse-client --receive_timeout=3600 --send_timeout=3600 --connect_timeout=3600 --format_csv_delimiter="|" --query="INSERT INTO tmp1.shit_happens FORMAT CSV" Error happen : clickhouse-client --format_csv_delimiter="|" --query="INSERT INTO tmp1.shit_happen FORMAT CSV" – Arnon Rodman May 27 '19 at 06:51
  • Does it reproduce without custom timeout settings? Do you have some other non-standard settings on the client or on the server? What is the clickhouse version used? BTW 1) it's not worth to add partition key at the beginning of ORDER BY key. 2) your sample is not reproducible i have "DB::Exception: Primary key must be a prefix of the sorting key" while trying your CREATE statement. – filimonov May 27 '19 at 15:06

2 Answers2

3

As previously mentioned, it was a memory problem. Adding --max_memory_usage=10000000000 did the trick for me

Pedro Muñoz
  • 590
  • 5
  • 11
1

The issue might be that you are partitioning your data on days and that your bulk insert CSV has too many days in it. Try removing the PARTITION BY toYYYYMMDD(business_ts) specification in your table creation. I noticed a similar issue when inserting into one of my tables. Prior to adding a --max_memory_usage argument I was getting exactly the same error you are reporting here: Code: 210. DB::NetException: Connection reset by peer, while writing to socket (127.0.0.1:9000)

I then added --max_memory_usage=15000000000 and I received a more helpful error message:

Received exception from server (version 20.11.5): Code: 252. DB::Exception: Received from localhost:9000. DB::Exception: Too many partitions for single INSERT block (more than 100). The limit is controlled by 'max_partitions_per_insert_block' setting. Large number of partitions is a common misconception. It will lead to severe negative performance impact, including slow server startup, slow INSERT queries and slow SELECT queries. Recommended total number of partitions for a table is under 1000..10000. Please note, that partitioning is not intended to speed up SELECT queries (ORDER BY key is sufficient to make range queries fast). Partitions are intended for data manipulation (DROP PARTITION, etc)..

As the more helpful error message points out, PARTITION is not there to help improve SELECT performance. It's really there to help facilitate non-querying manipulations more efficiently. I don't know all the details of your use-case here, but perhaps it may make sense to ORDER BY both spin_ts and business_ts and drop the PARTITION on business_ts.

geeves
  • 121
  • 2
  • 6