3

The application’s production environment started throwing the following error:

ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://search-production-*.us-west-2.es.amazonaws.com:*, URI [/timerecord… [HTTP/1.1 503 Service Unavailable]. {
  "message": "No server available to handle the request",
}

No relevant code that interfaces with elastic search has been pushed to production and there has been no significant increase to the amount of data that is running through elastic that would justify this increase. Nevertheless, the increase in JVM memory pressure is clear. Where should I look to investigate this issue further?

I’ve been reading the AWS documentation but am still unsure whether I should scale-up or scale-out.

TylerH
  • 20,799
  • 66
  • 75
  • 101
manu_dev
  • 53
  • 1
  • 7

1 Answers1

4

Your problem seems like it is related to a growing terms index. In memory, Lucene "maps prefixes of terms with the offset on disk where the block that contains terms that have this prefix starts".

Even though newer versions of ElasticSearch try to use less memory, we still have to pay a lot of attention to this.

I'm willing to bet that the High CPU usage is just because it's constantly trying to cleanup the exhausted heap space (memory). AWS ElasticSearch Instances allocate half of their memory to heap space.

Wether to scale up, out, or both, depends a lot on your mapping. You'll find some quick relief by scaling up to an instance with more memory, but you'll have to take a deeper look at your mapping and queries to get the best long-term scalability.

It is completely possible that the best solution would be to both scale up and out. If you provide your current instance types, and the number of nodes you are running I might be able to edit this answer to give a more tailored recommendation about how to scale for the short-term.

Elastic search is very picky. The hardware environment it likes to run on, although always memory intensive, varies a lot based your mapping and the types of queries you throw at it. It is likely that, after you get it stable, you'll have to tweak it to find the happy-spot based on your performance:cost:storage needs. Here is a good article about ElasticSearch Scalability and Resilience.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Justin Waulters
  • 303
  • 2
  • 8
  • Thank you so much for the quick answer, really insightful! Elastic runs in a t2.small instance and we are using two nodes. Right now we are trying to force it back up running but to no avail and AWS Support is not responding. – manu_dev May 20 '21 at 12:08
  • @manu_dev - do you know how your shards are set up? I'm guessing that the second t2.small is just redundancy. Is that so? Are you using the AWS managed ElasticSearch (ES) service or are you managing ES yourself? Also, I've never had success running t2 instances with ES. When the CPU hours run out they just grind to a halt. I'd suggest going for a memory optimized (Maybe R4 or R5; R6g might work but I've never tried ES on an ARM based system) instance. These don't use CPU credits, so even if you have to do a lot of heap cleanup it should just plow through - even if it slows things down. – Justin Waulters May 21 '21 at 02:43
  • Yup, the t2 instances were big part of the issue. The AWS support team wasn't able to recover the shards after 3 days, so we just setup a clean instance and pointed it to our ec2. We were able to restore the indices with our DB, so no real data-loss ocurred. Thank you so much for your suggestions! – manu_dev May 31 '21 at 22:58
  • I'm glad you got it fixed. Is there a way we can edit this answer to make it the accepted solution? – Justin Waulters Aug 21 '21 at 00:54