5

My production web service consists of:

  • Auto-scaling group
  • Network loadbalancer (ELB)
  • 2x EC2 instances as web servers

This configuration was running fine until yesterday when one of the EC2 instances started to experience RDS and ElastiCache timeouts. The other instance continues to run without issues.

During investigation, I noticed that outgoing connections in general sometimes experience large delays:

[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m7.147s -- 7 seconds
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m3.114s
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m0.051s
user    0m0.006s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    1m6.309s -- over a minute!
user    0m0.009s
sys     0m0.000s

[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  * * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  216.182.226.174  17.706 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.8.4), 1 hops max, 60 byte packets
 1  216.182.226.174  20.364 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.132), 1 hops max, 60 byte packets
 1  216.182.226.170  12.680 ms  12.671 ms *

Further analysis shows that if I manually detach the 'bad' instance from the auto-scaling group, removing it as a load balancer target, the problem instantly goes away. As soon as I add it back, the problem returns.

These nodes are m5.xlarge and appear to have excess capacity, so I don't believe it's a resource issue.

UPDATE: It seems related to load on the node. I put load back on last night and it seemed stable, but this morning as load is growing, outbound traffic (DB, etc.) start to fail. I'm really stuck not understanding how this outbound traffic is being impacted at all. The other identical node has no issues, even with 100% of the traffic versus 50%.

traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  216.182.226.174  18.691 ms 216.182.226.166  18.341 ms 216.182.226.174  18.660 ms
traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  * * *

What is the 216.182.226.166 IP? Is it related to the VPC IGW?

Node stats:

  • m5.xlarge
  • CPU ~ 7.5%
  • load average: 0.18, 0.29, 0.29
  • Network IN: ~8M bytes/minute

UPDATE: With 1 of the 2 nodes attached to the load balancer, things appear to run stable -- with all traffic on one node. After I add the 2nd node to the load balancer, after some period of time (hours - days), one of the nodes starts to exhibit outbound connection issues describe above (connection timing out to database, Google, etc.). In this state, the other node is working fine. Replacing the 'bad' or reinstating it in the load balancer allow things to run fine for a while. These images use Amazon Linux 2 (4.14.114-103.97.amzn2.x86_64).

DanielB6
  • 121
  • 6
  • Instances behind a load balancer should ideally be cattle rather than pets, auto scaled up and down as required. Terminate it and create another, using the working instance as a template if you need to. – Tim Jun 03 '19 at 21:29
  • 1
    Thanks for the feedback, that is definitely the intention. The cattle still need to serve their purpose though. There must be a way to debug, otherwise I'm afraid that every node will eventually suffer the same fate. – DanielB6 Jun 03 '19 at 21:44
  • VPC flow logs would be a good start. I wonder if RDS logs could have anything, but probably not. This is probably worth an AWS support case, as timeouts to RDS wouldn't be common. – Tim Jun 03 '19 at 22:30
  • Thanks, I'll see if I can setup VPC flow logs. I've posted a similar request in the AWS EC2 support forum. – DanielB6 Jun 03 '19 at 22:48
  • Which type of ELB? If Classic or Application, there's nothing in the way instances are bound to those types of load balancers that could have any foreseeable impact on outbound traffic. NLB is potentially a different matter. – Michael - sqlbot Jun 04 '19 at 00:13
  • It's a Network Load Balancer (TCP), auto-scaling is setup to automatically add and remove instances from the port 80 and 443 target groups. If I continuously traceroute or curl an external host (Google, RDS in EC2-Classic), I get sporadic success but a lot of extremely long queries. It was running fine for a couple weeks before the 1 out of 2 nodes started exhibiting this problem yesterday. No software or configurations were done. If I detach the bad instance, the problem goes away once traffic is drained. I setup flow logs, not really sure what to look for though. – DanielB6 Jun 04 '19 at 00:20
  • I replaced the sick cattle with a new instance to keep my number at 2. Now the one that was previously healthy is exhibiting the outgoing timeouts problem. The new one uses the same image, and is healthy. – DanielB6 Jun 06 '19 at 12:39

1 Answers1

0

It is possible you are using a NAT gateway/Instance to reach out to internet. If not, you may have to give more information on architecture. you could be using a direct connect and possibly routing internet via on-prem network.

Please read these relating to system limits, inbound connections for ephemeral ports.

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-recommended-nacl-rules.html https://aws.amazon.com/premiumsupport/knowledge-center/resolve-connection-nat-instance/

HumayunM
  • 101
  • 4
  • NLB (TCP 80, TCP 443) -> EC2 x2 (VPC). Internet access from the VPC is provided by an Internet Gateway. This architecture runs fine for a single EC2 instance. With 2 EC2 instances, one of them always fails eventually. The symptom is inability to access external resources like DB. Even sending a cURL request or tracert to Google suffers sporadic failures. Some responses in milliseconds, others up to 30 seconds. I wondered about ephemeral ports, but it seems to be far under the limit. I reviewed the sys limits but the config works with all traffic on 1 node, but not when split between 2. – DanielB6 Jun 11 '19 at 17:51