13

I have a cluster of 6 nodes with ES 5.4 with 4B small documents yet indexed.
Documents are organized in ~9K indexes, for a total of 2TB. The indexes' occupancy varies from few KB to hundreds of GB and they are sharded in order to keep each shard under 20GB.

Cluster health query responds with:

{
    cluster_name: "##########",
    status: "green",
    timed_out: false,
    number_of_nodes: 6,
    number_of_data_nodes: 6,
    active_primary_shards: 9014,
    active_shards: 9034,
    relocating_shards: 0,
    initializing_shards: 0,
    unassigned_shards: 0,
    delayed_unassigned_shards: 0,
    number_of_pending_tasks: 0,
    number_of_in_flight_fetch: 0,
    task_max_waiting_in_queue_millis: 0,
    active_shards_percent_as_number: 100
}

Before sending any query to the cluster, it is stable and it gets a bulk index query every second with 10 or some thousand of documents with no problem.

Everything is fine until I redirect some traffic to this cluster. As soon as it starts to respond the majority of the servers start reading from disk at 250 MB/s making the cluster unresponsive: enter image description here

What it is strange is that I cloned this ES configuration on AWS (same hardware, same Linux kernel, but different Linux version) and there I have no problem: enter image description here NB: note that 40MB/s of disk read is what I always had on servers that are serving traffic.

Relevant Elasticsearch 5 configurations are:

  • Xms12g -Xmx12g in jvm.options

I also tested it with the following configurations, but without succeeded:

  • bootstrap.memory_lock:true
  • MAX_OPEN_FILES=1000000

Each server has 16CPU and 32GB of RAM; some have Linux Jessie 8.7, other Jessie 8.6; all have kernel 3.16.0-4-amd64.

I checked that cache on each node with localhost:9200/_nodes/stats/indices/query_cache?pretty&human and all the servers have similar statistics: cache size, cache hit, miss and eviction.

It doesn't seem a warm up operation, since on AWS cloned cluster I never see this behavior and also because it never ends.
I can't find useful information under /var/log/elasticsearch/*.

Am I doing anything wrong?
What should I change in order to solve this problem?

Thanks!

Luca Mastrostefano
  • 3,201
  • 2
  • 27
  • 34
  • May I ask you to clarify something? So you have 6 servers with Linux, and compared with same cluster on AWS performance on AWS is ok. What are the disks on your servers? Are they spinning of SSD? AWS usually uses SSD over network, it may bring up difference. Also, amount of primary shards looks suspicious, check out this section of ES guide: https://www.elastic.co/guide/en/elasticsearch/guide/current/kagillion-shards.html Are these shards well spread over the cluster? Thank you. – Nikolay Vasiliev Aug 11 '17 at 14:07
  • All the disks have SSD. Regarding the shards, I've about 9000 indices and only 10 of them are sharded (anyway, max 16 shards per index). Shards are well balanced across the cluster. I've this configuration working on ES2.4 (same shards but less documents per index). – Luca Mastrostefano Aug 11 '17 at 15:06
  • Thanks, so you have this problem appearing while migrating from ES 2.4 to 5.4? May you provide a part of mapping in both es 2 and 5 (I understand yhere are 9k fields, so it's not possible to display it here)? – Nikolay Vasiliev Aug 12 '17 at 07:05
  • I would try _nodes/hot_threads to see what ES does while getting `stuck` ? – Nirmal Nov 13 '19 at 21:24

3 Answers3

0

You probably need to reduce the number of threads for searching. Try going with 2x the number of processors. In the elasticsearch.yaml:

threadpool.search.size:<size>

Also, that sounds like too many shards for a 6 node cluster. If possible, I would try reducing that.

Brandon Kearby
  • 603
  • 7
  • 8
0

The max content of an HTTP request. Defaults to 100mb

servers start reading from disk at 250 MB/s making the cluster unresponsive - The max content of an HTTP request. Defaults to 100mb. . If set to greater than Integer.MAX_VALUE, it will be reset to 100mb.

This will become unresponsive and you might see the logs related this. Check with the max read size of the indices.

Check with Elasticsearch HTTP

Jinna Balu
  • 6,747
  • 38
  • 47
0

a few things;

  1. 5.x has been EOL for years now, please upgrade as a matter of urgency
  2. you are heavily oversharded

for point 2 - you either need to;

  1. upgrade to handle that amount of shards, the memory management in 7.X is far superior
  2. reduce your shard count by reindexing
  3. add more nodes to deal with the load
warkolm
  • 1,933
  • 1
  • 4
  • 12