Getting lot of "DB::NetException: Connection reset by peer, while reading from socket" errors that are creating lot of noise

Question

I am running click house version '20.6.4' with default settings.While walking through the logs i found these logs in abundance.

ServerErrorHandler: Code: 210, e.displayText() = DB::NetException: Connection reset by peer, while reading from socket

and

u003cWarning\u003e ConnectionPoolWithFailover: Connection failed at try №1, reason: Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 172.16.*.*:9000 (172.16.*.*:9000) (version 20.6.3.28 (official build))","msg_id":"SERVER-1","namespace":"clickhouse.server","priority":6,"timestamp":"2020-09-21T00:01:23.623067Z","user_id":"","user_name":""}

I am using go-clickhouse client with default setting(no change to any timeout) Inserting data almost every minute(around 60-70k rows) Even though there dont seem to any impact but getting lot of these These are my timeout related settings:

 name                                             value       type
 
 connect_timeout                                  10          SettingSeconds             
 connect_timeout_with_failover_ms                 50          SettingMilliseconds        
 connect_timeout_with_failover_secure_ms          100         SettingMilliseconds        
 receive_timeout                                  300         SettingSeconds             
 send_timeout                                     300         SettingSeconds             
 tcp_keep_alive_timeout                           0           SettingSeconds             
 idle_connection_timeout                          3600        SettingUInt64              
 distributed_directory_monitor_sleep_time_ms      100         SettingMilliseconds        
 distributed_directory_monitor_max_sleep_time_ms  30000       SettingMilliseconds        
 insert_in_memory_parts_timeout                   600000      SettingMilliseconds        
 replication_alter_columns_timeout                60          SettingUInt64              
 insert_quorum_timeout                            600000      SettingMilliseconds        
 use_client_time_zone                             0           SettingBool                
 insert_distributed_timeout                       0           SettingUInt64              
 distributed_ddl_task_timeout                     180         SettingInt64               
 stream_poll_timeout_ms                           500         SettingMilliseconds        
 http_connection_timeout                          1           SettingSeconds             
 http_send_timeout                                1800        SettingSeconds             
 http_receive_timeout                             1800        SettingSeconds             
 query_profiler_real_time_period_ns               1000000000  SettingUInt64              
 query_profiler_cpu_time_period_ns                1000000000  SettingUInt64              
 max_execution_time                               0           SettingSeconds             
 timeout_overflow_mode                            throw       SettingOverflowMode        
 timeout_before_checking_execution_speed          10          SettingSeconds             
 temporary_live_view_timeout                      5           SettingSeconds             
 lock_acquire_timeout                             120         SettingSeconds             
 mark_cache_min_lifetime                          0           SettingUInt64              
 date_time_input_format                           basic       SettingDateTimeInputFormat 

Is there anything i can change to minimize these errrors??

PS: I am using native TCPconnection for insertion of queries(using 'kshvakov' client library) — Tomyhill, Sep 25 '20 at 12:51

Denny Crane · Answer 1 · 2020-09-25T14:49:56.047

1

It's two different issues.

Connection failed at try №1 connect_timeout_with_failover_ms 50

50 ms <-- check latency (ping) among replicas. If latency is > than 1 ms, then you need to increase connect_timeout_with_failover_ms

cat /etc/clickhouse-server/conf.d/user_substitutes.xml

 <?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
            <connect_timeout_with_failover_ms>1000</connect_timeout_with_failover_ms>
        <default>
        </profiles>
        </yandex>

edited Sep 25 '20 at 14:49

answered Sep 25 '20 at 13:07

Denny Crane

11,574
2
19
30

If you are talking about 'absolute_delay' column in 'system.replicas' table its value is 0. – Tomyhill Sep 25 '20 at 13:39
1

I am talking about network latency. you can check it with ping command – Denny Crane Sep 25 '20 at 13:59
yes the ping latency is ranging from 3-20 ms for other hosts in my cluster, what value should i update and till how much? – Tomyhill Sep 25 '20 at 14:03
I added a way how to increase connect_timeout_with_failover_ms to the Answer – Denny Crane Sep 25 '20 at 14:50
Just one thing @Denny the ping is somewhere btw 3-20 ms with 4 % packetloss but connect_timeout_with_failover_ms is set to 50 ms so is it the issue since value is less that 50 ms and to what i should increase this value to 100, 200?? – Tomyhill Sep 25 '20 at 14:59
feel free to set 100, 200, wait for errors and increase more if needed. – Denny Crane Sep 25 '20 at 16:03
I did changed the timeout like you suggested now not getting those error but end up getting these new error on port 9009. Link to the issue:https://stackoverflow.com/questions/64125607/getting-connect-timeout-on-9009-port-of-clickhouse – Tomyhill Sep 29 '20 at 18:34

score 1 · Answer 2 · answered Sep 25 '20 at 13:12

1

Connection reset by peer, while reading from socket

What IP addresses in that message? Is it CH servers or clients? It's a real error though. It just means that client has gone and server did not get all expected data.

answered Sep 25 '20 at 13:12

Denny Crane

11,574
2
19
30

This is clickhouse server ip. – Tomyhill Sep 25 '20 at 13:37
Also getting both reading from socket and writing from socket. – Tomyhill Sep 25 '20 at 14:01
check what happen at that server at this moment. Why it stopped to send data. – Denny Crane Sep 25 '20 at 14:03
i am very new to clickhouse. How can i check ?? Is there any system table where i can see that?? – Tomyhill Sep 25 '20 at 14:05
just check a log at that time at the server with that IP – Denny Crane Sep 25 '20 at 14:51

Getting lot of "DB::NetException: Connection reset by peer, while reading from socket" errors that are creating lot of noise

2 Answers2