OPTIMIZE FINAL of a large partition blocked a GET_PART operation and prevented a node from accepting queries

Question

I ran into the scenario in the title on this cluster:

5 shards, 5 replicas
Google Cloud Compute
One table only on the cluster (sharded and replicated) with ReplicatedReplacingMergeTree. I can provide the schema if needed, but the issue should not depend on that
Clickhouse 21.8.13.1.altinitystable. (but also reproduced on 20.7.2.30)

This is the sequence of events:

I executed an OPTIMIZE TABLE .... PARTITION .... FINAL on one node of each of the shards. The partition is fairly large (120Gb) so that process would take longer than one hour.
The optimize started and was visible in system.merges and system.replication_queue as usual.
During the process one of the nodes was restarted because of a GCP maintenance event and came back up a few minutes later.
Once Clickhouse restarted, it restarted the merge as expected. Though three GET_PART operations (I assume parts that were created during the downtime and had to be replicated) did not start as they were waiting on the large merge to complete. See the output of the replication_queue table below. 90-20220530_0_1210623_1731 part is the one indeed covered by the merge generated by the OPTIMIZE statement

SELECT
    replica_name,
    postpone_reason,
    type
FROM system.replication_queue

(formatted)

replica_name:    snuba-errors-tiger-4-4 
postpone_reason: Not executing log entry queue-0055035589 for part 90-20220530_0_1210420_1730 because it is covered by part 90-20220530_0_1210623_1731 that is currently executing.
type:            GET_PART

replica_name:    snuba-errors-tiger-4-4 
postpone_reason: Not executing log entry queue-0055035590 for part 90-20220530_1210421_1210598_37 because it is covered by part 90-20220530_0_1210623_1731 that is currently executing.
type:            GET_PART


replica_name:    snuba-errors-tiger-4-4 
postpone_reason: Not executing log entry queue-0055035591 for part 90-20220530_1210599_1210623_6 because it is covered by part 90-20220530_0_1210623_1731 that is currently executing.
type:            GET_PART

replica_name:    snuba-errors-tiger-4-4
postpone_reason:
type:            MERGE_PARTS

The metrics around the replication delay increased up to 1:30 minutes and the distributed table did not send any query to this node until the merge was done (90 minutes later)

Is this a normal behavior ? If yes, is there a way to prevent a long merge from blocking replication in case of restart ? max_replica_delay_for_distributed_queries is set to 300 seconds on the cluster. I was expecting the 1:30 minutes delay would be ignored but that did not seem to be the case as no query was routed to the impacted node. Is there another way to tell Clickhouse to ignore the replication delay ?

Thank you Filippo

OPTIMIZE FINAL of a large partition blocked a GET_PART operation and prevented a node from accepting queries

0 Answers0