My production web service consists of:
- Auto-scaling group
- Network loadbalancer (ELB)
- 2x EC2 instances as web servers
This configuration was running fine until yesterday when one of the EC2 instances started to experience RDS and ElastiCache timeouts. The other instance continues to run without issues.
During investigation, I noticed that outgoing connections in general sometimes experience large delays:
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null
real 0m7.147s -- 7 seconds
user 0m0.007s
sys 0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null
real 0m3.114s
user 0m0.007s
sys 0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null
real 0m0.051s
user 0m0.006s
sys 0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null
real 1m6.309s -- over a minute!
user 0m0.009s
sys 0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
1 * * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
1 216.182.226.174 17.706 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.8.4), 1 hops max, 60 byte packets
1 216.182.226.174 20.364 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.132), 1 hops max, 60 byte packets
1 216.182.226.170 12.680 ms 12.671 ms *
Further analysis shows that if I manually detach the 'bad' instance from the auto-scaling group, removing it as a load balancer target, the problem instantly goes away. As soon as I add it back, the problem returns.
These nodes are m5.xlarge and appear to have excess capacity, so I don't believe it's a resource issue.
UPDATE: It seems related to load on the node. I put load back on last night and it seemed stable, but this morning as load is growing, outbound traffic (DB, etc.) start to fail. I'm really stuck not understanding how this outbound traffic is being impacted at all. The other identical node has no issues, even with 100% of the traffic versus 50%.
traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
1 216.182.226.174 18.691 ms 216.182.226.166 18.341 ms 216.182.226.174 18.660 ms
traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
1 * * *
What is the 216.182.226.166 IP? Is it related to the VPC IGW?
Node stats:
- m5.xlarge
- CPU ~ 7.5%
- load average: 0.18, 0.29, 0.29
- Network IN: ~8M bytes/minute
UPDATE: With 1 of the 2 nodes attached to the load balancer, things appear to run stable -- with all traffic on one node. After I add the 2nd node to the load balancer, after some period of time (hours - days), one of the nodes starts to exhibit outbound connection issues describe above (connection timing out to database, Google, etc.). In this state, the other node is working fine. Replacing the 'bad' or reinstating it in the load balancer allow things to run fine for a while. These images use Amazon Linux 2 (4.14.114-103.97.amzn2.x86_64).