High ELB latency using NAT and public/private subnets

Question

I started with a default VPC configuration for our application but it got a bit more complex recently. So basically we are using an ECS cluster with 1 EC2 instance. 1 ELB that is linked to the ECS service.

We recently had to implement SQS with Lambda and face the fact that we had to use NAT in order for the lambda function to access the SQS queue. Since we added this NAT everything went wrong.

So in term of network configuration, it's the default one pretty much:

- 1 VPC (172.31.0.0/16)
- 2 Public subnets: 
  - pubsub1 - CIDR: 172.31.48.0/20
  - pubsub2 - CIDR: 172.31.0.0/20
- 2 Private subnets:
  - privsub1 - CIDR: 172.31.16.0/20
  - privsub2 - CIDR: 172.31.32.0/20

- 1 Main route table (not explicitelly assign to any subnets):
  - 172.31.0.0/16 -> local
  - 0.0.0.0/0 -> igw
- 1 Public route table (pubsub1 and pubsub2):
  - 172.31.0.0/16 -> local
  - 0.0.0.0/0 -> igw
- 1 Private route table (privsub1 and privsub2):
  - 172.31.0.0/16 -> local
  - 0.0.0.0/0 -> NAT

RDS using the default subnet group (pubsub1, pubsub2, privsub1 and privsub2)
EC2 (part of ECS cluster) using privsub1 subnet
ELB using pubsub1 and pubsub2 subnets
Lambda using privsub1 and privsub2 subnets

The ELB is reporting the healthcheck is failing and remove my EC2 instance from the pool. however if I ssh the EC2 box (using an intermediate ec2 server in public subnet), and try to curl localhost:80/healthcheck.html (which is the ELB healthcheck configuration) it's responding correclty.

I check the security groups too:

- 1 security group for the ELB allowing HTTP and HTTPS to ALL inbound source and allowing ALL outbound traffic
- 1 security group for the EC2 server allowing HTTP inbound from the elb-security-group (I also tried from all source)
- 1 security group for the RDS allowing TCP connection on database from ec2-security-group

If I add the ELB to the 2 private subnets, then the healthcheck is working. However running a curl request I can see high latency:

HTTPCode=200 TotalTime=1.401
HTTPCode=200 TotalTime=1.660
HTTPCode=200 TotalTime=1.537
HTTPCode=200 TotalTime=1.529
HTTPCode=200 TotalTime=1.519

At this point I'm a bit lost and have no idea what to do. I'm pretty sure it's a network issue but I cannot isolate it.

Here is one of the chrome request timing:

and a subsequent exact same request:

I also posted on AWS forum: https://forums.aws.amazon.com/thread.jspa?threadID=236569

UPDATE1

I've enabled the cross-zone load balancing on my ELB to fix my health check issue (ELB being in public subnets and EC2 in private).
Network ACL are the default ones and allow everything.

The ELB latency still the same (1 to 2 seconds)
- Moving the EC2 in public subnet and hitting the box directly, the response time is down to 400ms
The RDS instance that is in public subnets and private subnets is not accessible from our office (outside world) since we've added the NAT.

UPDATE2

I fixed the issue we had with the RDS not being accessible from our office. I think the fact that we enabled NAT and that the RDS was using the 4 subnets (the 2 public and the 2 private ones) caused the issue.
The RDS needed to ONLY be using the public subnets. However, modifying the subnet group for RDS is not enough. Even though the RDS information details show that the subnet has changed, it's not taking it into account.

From the AWS FAQ:

Q: Can I change the DB Subnet Group of my DB Instance?

[...] At the present time, updating an existing DB Subnet Group does not change the current subnet of the deployed DB instance; an instance-type scale operation is required. Explicitly changing the DB Subnet Group of a deployed DB instance is not currently allowed.

So the only way is to change the size of the RDS instance, or deploy a new instance from a db snapshot specifying the new subnet group (which is ONLY using public subnets). Make sure the security group is the correct one too because it's selecting the default one by default.

I'm still investigating the ELB latency...

*Lambda using pubsub1 and pubsub2 subnets* can't be right. Check that? Next, are your VPC network ACLs all set to allow everything? That's the most obvious potential problem here, because ELB on a public subnet and EC2 behind it on a private subnet is standard configuration. Also, are you saying the ELB is alternating between working and not working? That suggests a misconfiguration of one of its subnets and not the other. If reviewing your config doesn't help, putting ELB on only one subnet at a time may help pinpoint what's going on. — Michael - sqlbot, Aug 03 '16 at 13:05
Sorry, Lambda is using private subnets ONLY. I'll fix that. Maybe it's working/not working, with ELB using public subnets ONLY, but I can't test it because the health check is failing and the instance get deregistered :/ — maxwell2022, Aug 03 '16 at 18:05
Also I cannot access the RDS instance from the ofice box (EC2 instances can access it fine), even if it's in both public and private subnets and that the security group is allowing connection from the office IP. I use to be able before we added NAT — maxwell2022, Aug 03 '16 at 18:11
Okay... Somehow the ELB cross-zone was disabled. Now the healthcheck is working. However I still have 1.5-2 seconds latency. I moved the EC2 in a public subnet to try to hit the box directly and it's down to 500-400ms. So it's something between the ELB and the EC2 server — maxwell2022, Aug 04 '16 at 02:11

High ELB latency using NAT and public/private subnets

0 Answers0