AWS Elastic Load Balancing: Seeing extremely long initial connection time

Question

For a couple of days, we often see an extremely long initial connection time (15s - 1.3 minutes) to our ELBs when making any request via ssl. Oddly, I was only able to observe this in Google Chrome (not Safari nor Firefox nor curl).

It does not occur every single request, but around 50% of requests. It occurs with the first request (OPTIONS-call).

Our setup is the following: Cross-Zone ELB that connects to a node.js backend (currently in 2 AZs in eu-west-1). All instances are healthy and once the request comes through, it is processed normally. Currently, there is basically no load on the system. Cloudwatch for ELB does not report any backend connection errors, neither a SurgeQueue (value 0) nor a spillover count. The ELB metrics show a low latency (< 100 ms). We have Route53 configured to route to the ELB (we don't see any dns trouble, see attached screenshot).

We have different REST-APIs that all have this setup. It occurs to all of the ELBs (each of them is connecting to an indipendent node.js backend). All of these ELBs are set up the same way via our cloudformation template.

The ELBs also do our SSL-termination.

What could lead to such a behavior? Is it possible that the ELBs are not configured properly? And why could it only appear on Google Chrome?

You should install wireshark on the machine with the browser and try to identify at what point in the tcp handshake the latency is appearing. This seems very unusual. — Michael - sqlbot, Feb 20 '16 at 18:33
@gboda good find, pity it has no answers, either. Maybe we have another one here somewhere that does. — Michael - sqlbot, Feb 20 '16 at 18:35
Weird, here's [probably another one](http://stackoverflow.com/questions/34905110/site-connecting-very-slowly-in-chrome-dns-issue) also unanswered. Strange Chrome + ELB interaction? — Michael - sqlbot, Feb 20 '16 at 18:38
I just created a same issue, but not for ELB - rather for ALB [here](https://stackoverflow.com/questions/48287348/aws-application-load-balancing-seeing-extremely-long-initial-connection-time/48287350#48287350). We found a solution, but interestingly enough, all the symptoms were exactly the same as in this question. — Bruno Batarelo, Jan 16 '18 at 18:09

score 36 · Accepted Answer · answered Feb 25 '16 at 17:45

36

I think it is a possible ELB misconfiguration. I had the same problem when I put private subnets to ELB. Fixed it by changing private subnets to public. See https://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/elb-manage-subnets.html

answered Feb 25 '16 at 17:45

Nikita Ogurtsov

476
4
2

For public facing ELBs, select only public subnets. For private facing ELBs, select only private subnets. – Miguel Mota Mar 15 '18 at 19:39

score 11 · Answer 2 · edited May 08 '16 at 07:12

11

Just to follow up on @Nikita Ogurtsov's excellent answer; I had the same problem except that it was just one of my subnets that happened to be private and the rest public.

Even if you think your subnets are public, I recommend you double check the route tables to ensure that they all have a Gateway.

You can use a single Route Table that has a Gateway for all your LB subnets if this make sense

VPC/Subnets/(select subnet)/Route Table/Edit

edited May 08 '16 at 07:12

Atiq

14,435
6
54
69

answered May 08 '16 at 06:09

Alan Barker

193
2
6

In my case one of the subnet's ACL was configured to deny all traffic. – M3L Mar 07 '19 at 15:30

Elijah Lofgren · Answer 3 · 2020-05-13T14:48:18.023

3

For me the issue was that I had an unused "Availability Zone" in my Classic Load Balancer. Once I removed the unhealthy and unused Availability Zone the consistent 20 or 21 second delay in "Initial Connection" dropped to under 50ms.

Note: You may need to give it time to update. I had my DNS TTL set to 60 seconds so I would see the fix within a minute of removing the unused Availability Zone.

edited May 13 '20 at 14:48

answered Apr 20 '19 at 20:28

Elijah Lofgren

1,437
3
23
39

score 1 · Answer 4 · answered Apr 04 '16 at 08:53

This can be a problem with the elb of amazon. The elb scale the number of instances with the number of request. You should see some pick of requests at those times. Amazon adds some instances in order to fit the load. the instances are reachable during the launch process so your clients get those timeout. it's totally randomness so you should :

ping the elb in order to get all the ip used
use mtr on all ip found
Keep an eye on CloudWatch
Find some clues

Djory Krache · Answer 5 · 2017-10-08T17:36:47.743

Solution If you're DNS is configured to hit directly on the ELB -> you should reduce the TTL of the association (IP,DNS). The IP can change at any time with the ELB so you can have serious damage on your traffic.

The client keep Some IP from the ELB in cache so you can have those can of trouble.

Scaling Elastic Load Balancers Once you create an elastic load balancer, you must configure it to accept incoming traffic and route requests to your EC2 instances. These configuration parameters are stored by the controller, and the controller ensures that all of the load balancers are operating with the correct configuration. The controller will also monitor the load balancers and manage the capacity that is used to handle the client requests. It increases capacity by utilizing either larger resources (resources with higher performance characteristics) or more individual resources. The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS. The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds. By default, Elastic Load Balancing will return multiple IP addresses when clients perform a DNS resolution, with the records being randomly ordered on each DNS resolution request. As the traffic profile changes, the controller service will scale the load balancers to handle more requests, scaling equally in all Availability Zones.

Best Practices ELB on AWS

You can't set up the TTL in Route53 if the entry is an ELB alias. — M3L, Mar 07 '19 at 15:29
Yes, but I didn't talk about Route53. Of course, Amazon preconfigured the DNS for its own ELB otherwise you'll have the error previously presented. — Djory Krache, Mar 14 '19 at 11:55
I've recently solved this problem by setting TTL of the alias record to 60 seconds. — Alexander Pravdin, Mar 19 '19 at 02:49

score 1 · Answer 6 · answered Feb 13 '21 at 15:47

1

ALB Loadbalancer need 2 Availability Zones. If you use a Privat/Public/Nat VPC setting, then must all public subnets have a connection to the Internet.

answered Feb 13 '21 at 15:47

Moo

87
6

Dmitry Grinko · Answer 7 · 2021-10-22T18:27:32.190

0

Check a security group too. That was an issue in my case.

edited Oct 22 '21 at 18:27

answered Apr 10 '19 at 21:06

Dmitry Grinko

13,806
14
62
86

Can you elaborate? – Mathias Lykkegaard Lorenzen Oct 22 '21 at 15:13
1

@MathiasLykkegaardLorenzen A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. So make sure it is configured properly. Unfortunately I don't remember details. – Dmitry Grinko Oct 22 '21 at 18:27

score 0 · Answer 8 · answered Oct 18 '19 at 05:30

For me the issue was that the ALB was pointing to an Nginx instance, which had a misconfigured DNS resolver. This meant that Nginx tried to use the resolver, timed out, and then actually started working a bit later.

Not really super connected with Load Balancer itself, but maybe helps someone figure out the issue in their own setup.

score 0 · Answer 9 · answered May 30 '22 at 20:31

0

I see a similar problem in my Chrome logs (1.3m lag). It happens in an OPTIONS request, and from wireshark, I don't even see the request leaving the PC in the first place. Any suggestions as to what Chrome might be doing are welcome.

answered May 30 '22 at 20:31

Jerod Venema

44,124
5
66
109

score -1 · Answer 10 · answered Aug 24 '22 at 12:22

We have recently encountered chrome taking 1.3 mins to load pages but the cause was slightly different. Just popping it here incase it helps someone.

1.3 mins seems to be how long Chrome will wait when trying to connect to a specific IP. Our domain name has multiple IP addresses in the A record (similar to a CNAME setup) and one of those IP's belonged to a server that had crashed. So sometimes the browser would connect quickly because it used a valid IP and sometimes we would get the long wait as it tried to connect to the invalid IP, timed out, and then retried with a valid IP.

So it is worth checking that all the IP's listed when you dig your domain, are resolving correctly.

AWS Elastic Load Balancing: Seeing extremely long initial connection time

10 Answers10

Linked