0

We're using node js backend servers running in AWS ECS, behind an ALB. We then have AWS API gateway with a proxy lambda calling the ALB. This has been running in production for months, when suddenly a few days ago we started seeing 502 errors from some API calls.

I've checked the proxy lambda logs to see that the 502 is returned from the ALB. However, when I check my node application logs, there are no failing requests, in fact no requests seem to have reached the application at these timestamps. I then enabled access logs on the ALB, which only shows 200/201 responses - no 5xx whatsoever. I'm now a bit confused as to where to look next. What could cause my ALB to return 502 without this being present in the ALB access logs? And what could cause the requests to not reach my node app in ECS? Does anyone have any idea on what logs to check next or what to do to pinpoint the errors? Could some layer within ECS cause those symptoms? I can't see any errors in my docker containers or anything.

It seems to happen in bursts, up to 50 failed requests within a period of time, then all ok for several hours.

JHH
  • 8,567
  • 8
  • 47
  • 91

2 Answers2

0

It could be due to a number of reasons. The below may be applicable to you -

The load balancer received a TCP RST from the target when attempting to establish a connection.

The load balancer received an unexpected response from the target, such as "ICMP Destination unreachable (Host unreachable)", when attempting to establish a connection. Check whether traffic is allowed from the load balancer subnets to the targets on the target port.

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.

The target response is malformed or contains HTTP headers that are not valid.

The load balancer encountered an SSL handshake error or SSL handshake timeout (10 seconds) when connecting to a target.

reference docs

Nithin Kumar Biliya
  • 2,763
  • 3
  • 34
  • 54
0

This turned out to be memory leaks in my container applications. The RAM usage grew with every request until crash. At that point it took a while for ECS and ALB to react, so a bunch of requests were routed to the dead instance. The problem was resolved by fixing the leak, but I'd have wanted better built in support for alarms on high memory usage from ECS/cloudwatch with triggers to replace instances on high usage gracefully. Seems i have to build that from scratch.

JHH
  • 8,567
  • 8
  • 47
  • 91