0

I have multiple node web servers hosted on EC2 with a Load Balancer, and some users are getting a 502 even before the request reaches the server.

I don't have the logs of those requests inside the servers, that's why I am assuming that the request never reaches the server.

I had a similar problem before, and I had to add keepAliveTimeout and headersTimeout to the node configuration.

I have a few unhealthy instances during the day, every day, but the time when does that happen doesn't always match with the time of the 502 error. Should I increase the health check timeout from 5s to 10s and see what happens?

The memory and the CPU usage seems fine.

Any tips on how should I debug this issue?

soltex
  • 101
  • 1
  • 2

1 Answers1

0

you already know the answer: unhealthy instances. even if times does not match, you should fix that problem and check if others issues persist after.

increase instance size, increase ELB healthcheck timeouts, scale up machines and check if it helps

exeral
  • 1,787
  • 11
  • 21
  • Yes, you are right! I will start by increasing the healthcheck timeouts. Actually, the memory usage and cpu seem fine to me, that's why I am not sure If I should upgrade the machines. Anyway, I will give it a try if the healthcheck timeouts don't work. – soltex Aug 11 '21 at 09:21
  • the size may not resolve your problem since your metrics are ok. but it is easy and cheap to bump the size for 1hour, so still worth to give it a try. – exeral Aug 11 '21 at 20:42
  • Increasing the healthcheck timeouts decreased a few unhealthy instances, but the number of 502 errors is the same. I will try to bump the instances, as you said, is still worth giving it a try. – soltex Aug 12 '21 at 12:39
  • Bumping the instances didn't work. Do you have any other ideas? I don't even know why do I have unhealhty instances if the metrics are ok. – soltex Aug 13 '21 at 09:20
  • what is your healtcheck. what are the correspond logs on the EC2 to that healthchecks. – exeral Aug 13 '21 at 10:10
  • This is my healcheck, `Unhealthy threshold`: 2 consecutive health check failures (same to healh tresold), `Timeout`: 5s, `Interval`: 10s, `Algorithm`: Round robin, and the logs are something like this `GET /health-check 200 0ms`. Unfortunately, I don't have the logs from the instance that was considered unhealhty, I might enable that and see what was the response time right before the instane was terminated. – soltex Aug 13 '21 at 10:19