2

How do you determine reasonable health check timeouts for load balancers?

My application is failing load balancer health checks. I'm using the default health check timeout of 5 seconds, but I've noticed that average latency graphs on CloudWatch jump up to ~50s during periods (lasting between 2-4 hours) when the application runs at its peak of ~30% CPU utilization. Memory utilization and IOPs are all low and stable. Is a 30% utilization high enough to expect health check responses to increase beyond 5 seconds? If so, is there a standard practice of determining the health check timeout?

  • 2
    I'd stop worrying about adjusting the timeouts and figure out why your site's taking 50+ seconds to respond for hours at a time. – ceejayoz Jun 25 '18 at 17:22

1 Answers1

0

The answer to your question must be vague. The answer to the question 'How do I determine the correct health check timeout`?' is very similar to answer the question 'What latency is still considered healthy for my application?'.

The general guide could be paraphrased as follows:

  1. Determine an acceptable latency for your application. In your case, we could assume that 50 seconds is still acceptable. I would consider this very abnormal, but since I don't know your application, I am working with that.

  2. Set the timeout to something a bit beyond that at first, say 55 seconds.

  3. Load test your application with load similar to your production load, and see if it works for you.

  4. Make adjustements to your application and health check as necessary, repeat until you are satisfied with the results, and put it into production.

  5. Start over at 1.

Considering your second question toward CPU utilization: That depends on your application. Run tests, run load tests, find the bottle neck, remove the bottle neck.

M. Glatki
  • 1,964
  • 1
  • 17
  • 33