I have an AWS hosted ElasticSearch cluster, which fails when heap reaches 75% and the (CMS) garbage collector runs.
The cluster runs ES version 7.9 with 3 dedicated Master nodes (r5.large.elasticsearch) and 4 Data nodes (r5.xlarge.elasticsearch)
That is: 4 vCPU / 32GB instance per Data Node (16GB heap), with 1TB of SDD storage each, for a total of 4TB storage. 2 vCUP / 16GB instance per Master node
The cluster holds 33 indices with 1-3 primary shards each and 0-1 replicas (0 for the older ones), and a size ranging between 50Mb to 60Gb per shard, but in general each shard stores 30gb. So about 65 shards in total.
Whenever the JVM Memory Pressure goes up to 75% and the Garbage Collector (GC) runs we start to get Timeouts and the node running the GC goes down for a moment and then back up, causing shards reallocation, more timeouts, increased index and search latencies.
Checking the error logs we could see a lot of:
[WARN ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][2315905] overhead, spent [6.4s] collecting in the last [7.2s]
[WARN ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][2315905] overhead, spent [3.6s] collecting in the last [4.4s]
...
On peak hours our Indexing rate is about 4k operations/min and search rate is 1k operations/min
The GC runs about 3 times a day per data node, about 12 times a day per cluster, and the maximum Heap percent among the 4 data nodes oscillates between 35% and 75%, it never goes above 75%. When the GC is not running, CPU stays consistently at an average of 13% - 15%, so we’re highly confident that the instance size is the appropriate one for our current traffic.
Followed some guides on how to avoid node crashes, but: Rarely aggregate on text fields. No complex aggregations. Shards are evenly distributed, and the number of shards per index seems to be correct. Very small number of wildcard queries, which are manually triggered. All the documents are small-medium sized (500 - 1000 characters).
So, any ideas on what could possibly be causing these crashes and long GC runs?
Found some related questions with no answer such as this