Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then

Question

Regularly the past days our ES 7.4 cluster (4 nodes) is giving read timeouts and is getting slower and slower when it comes to running certain management commands. Before that it has been running for more than a year without any trouble. For instance /_cat/nodes was taking 2 minutes yesterday to execute, today it is already taking 4 minutes. Server loads are low, memory usage seems fine, not sure where to look further.

Using the opster.com online tool I managed to get some hint that the management queue size is high, however when executing the suggested commands there to investigate I don't see anything out of the ordinary other than that the command takes long to give a result:

$ curl "http://127.0.0.1:9201/_cat/thread_pool/management?v&h=id,active,rejected,completed,node_id"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   345  100   345    0     0      2      0  0:02:52  0:02:47  0:00:05    90
id                     active rejected completed node_id
JZHgYyCKRyiMESiaGlkITA      1        0   4424211 elastic7-1
jllZ8mmTRQmsh8Sxm8eDYg      1        0   4626296 elastic7-4
cI-cn4V3RP65qvE3ZR8MXQ      5        0   4666917 elastic7-2
TJJ_eHLIRk6qKq_qRWmd3w      1        0   4592766 elastic7-3

How can I debug this / solve this? Thanks in advance.

score 2 · Accepted Answer · edited Feb 26 '21 at 16:53

2

If you notice your elastic7-2 node is having 5 active requests in the management queue, which is really high, As the management queue capacity itself is just 5, and it's used only for very few operations(Management, not search/index).

You can have a look at threadpools in elasticsearch for further read.

edited Feb 26 '21 at 16:53

Martijn Pieters

1,048,767
296
4,058
3,343

answered Nov 27 '20 at 08:48

Amit

30,756
6
57
88

1

Thanks for the help! I managed to pinpoint that it was elastic7-2 that was the problem. It had those 5 active management commands, but there seemed to be no hot threads. Then I started calling /_nodes/elastic7-1, -2, etc.. and notices that when calling elastic7-2 it would take forever, even when called from elastic7-2 itself. So I decided to restart that one and now things seem to be fast and green again. <3 – Lourens Rozema Nov 27 '20 at 08:53
Do you have an idea how elastic7-2 could have come into this state? In the past two weeks I have seen disconnects and master re-elections while I could not identify any network connectivity issues. Would that be a result of the full queue or did the queue fill up because of networking issues? – Lourens Rozema Nov 27 '20 at 09:00

Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then

1 Answers1

Linked