25

For a couple of days, we often see an extremely long initial connection time (15s - 1.3 minutes) to our ELBs when making any request via ssl. Oddly, I was only able to observe this in Google Chrome (not Safari nor Firefox nor curl).

It does not occur every single request, but around 50% of requests. It occurs with the first request (OPTIONS-call).

Our setup is the following: Cross-Zone ELB that connects to a node.js backend (currently in 2 AZs in eu-west-1). All instances are healthy and once the request comes through, it is processed normally. Currently, there is basically no load on the system. Cloudwatch for ELB does not report any backend connection errors, neither a SurgeQueue (value 0) nor a spillover count. The ELB metrics show a low latency (< 100 ms). We have Route53 configured to route to the ELB (we don't see any dns trouble, see attached screenshot).

We have different REST-APIs that all have this setup. It occurs to all of the ELBs (each of them is connecting to an indipendent node.js backend). All of these ELBs are set up the same way via our cloudformation template.

The ELBs also do our SSL-termination.

What could lead to such a behavior? Is it possible that the ELBs are not configured properly? And why could it only appear on Google Chrome?

request timing

ahrzg
  • 288
  • 1
  • 3
  • 6
  • You should install wireshark on the machine with the browser and try to identify at what point in the tcp handshake the latency is appearing. This seems very unusual. – Michael - sqlbot Feb 20 '16 at 18:33
  • @gboda good find, pity it has no answers, either. Maybe we have another one here somewhere that does. – Michael - sqlbot Feb 20 '16 at 18:35
  • Weird, here's [probably another one](http://stackoverflow.com/questions/34905110/site-connecting-very-slowly-in-chrome-dns-issue) also unanswered. Strange Chrome + ELB interaction? – Michael - sqlbot Feb 20 '16 at 18:38
  • I just created a same issue, but not for ELB - rather for ALB [here](https://stackoverflow.com/questions/48287348/aws-application-load-balancing-seeing-extremely-long-initial-connection-time/48287350#48287350). We found a solution, but interestingly enough, all the symptoms were exactly the same as in this question. – Bruno Batarelo Jan 16 '18 at 18:09

10 Answers10

36

I think it is a possible ELB misconfiguration. I had the same problem when I put private subnets to ELB. Fixed it by changing private subnets to public. See https://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/elb-manage-subnets.html

Nikita Ogurtsov
  • 476
  • 4
  • 2
11

Just to follow up on @Nikita Ogurtsov's excellent answer; I had the same problem except that it was just one of my subnets that happened to be private and the rest public.

Even if you think your subnets are public, I recommend you double check the route tables to ensure that they all have a Gateway.

You can use a single Route Table that has a Gateway for all your LB subnets if this make sense

VPC/Subnets/(select subnet)/Route Table/Edit

Atiq
  • 14,435
  • 6
  • 54
  • 69
Alan Barker
  • 193
  • 2
  • 6
3

For me the issue was that I had an unused "Availability Zone" in my Classic Load Balancer. Once I removed the unhealthy and unused Availability Zone the consistent 20 or 21 second delay in "Initial Connection" dropped to under 50ms.

Note: You may need to give it time to update. I had my DNS TTL set to 60 seconds so I would see the fix within a minute of removing the unused Availability Zone.

Elijah Lofgren
  • 1,437
  • 3
  • 23
  • 39
1

This can be a problem with the elb of amazon. The elb scale the number of instances with the number of request. You should see some pick of requests at those times. Amazon adds some instances in order to fit the load. the instances are reachable during the launch process so your clients get those timeout. it's totally randomness so you should :

  • ping the elb in order to get all the ip used

  • use mtr on all ip found

  • Keep an eye on CloudWatch

  • Find some clues

Djory Krache
  • 347
  • 4
  • 9
1

Solution If you're DNS is configured to hit directly on the ELB -> you should reduce the TTL of the association (IP,DNS). The IP can change at any time with the ELB so you can have serious damage on your traffic.

The client keep Some IP from the ELB in cache so you can have those can of trouble.

Scaling Elastic Load Balancers Once you create an elastic load balancer, you must configure it to accept incoming traffic and route requests to your EC2 instances. These configuration parameters are stored by the controller, and the controller ensures that all of the load balancers are operating with the correct configuration. The controller will also monitor the load balancers and manage the capacity that is used to handle the client requests. It increases capacity by utilizing either larger resources (resources with higher performance characteristics) or more individual resources. The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS. The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds. By default, Elastic Load Balancing will return multiple IP addresses when clients perform a DNS resolution, with the records being randomly ordered on each DNS resolution request. As the traffic profile changes, the controller service will scale the load balancers to handle more requests, scaling equally in all Availability Zones.

Best Practices ELB on AWS

Djory Krache
  • 347
  • 4
  • 9
1

ALB Loadbalancer need 2 Availability Zones. If you use a Privat/Public/Nat VPC setting, then must all public subnets have a connection to the Internet.

Moo
  • 87
  • 6
0

Check a security group too. That was an issue in my case.

Dmitry Grinko
  • 13,806
  • 14
  • 62
  • 86
0

For me the issue was that the ALB was pointing to an Nginx instance, which had a misconfigured DNS resolver. This meant that Nginx tried to use the resolver, timed out, and then actually started working a bit later.

Not really super connected with Load Balancer itself, but maybe helps someone figure out the issue in their own setup.

Janis Peisenieks
  • 4,938
  • 10
  • 55
  • 85
0

I see a similar problem in my Chrome logs (1.3m lag). It happens in an OPTIONS request, and from wireshark, I don't even see the request leaving the PC in the first place. Any suggestions as to what Chrome might be doing are welcome. enter image description here

Jerod Venema
  • 44,124
  • 5
  • 66
  • 109
-1

We have recently encountered chrome taking 1.3 mins to load pages but the cause was slightly different. Just popping it here incase it helps someone.

1.3 mins seems to be how long Chrome will wait when trying to connect to a specific IP. Our domain name has multiple IP addresses in the A record (similar to a CNAME setup) and one of those IP's belonged to a server that had crashed. So sometimes the browser would connect quickly because it used a valid IP and sometimes we would get the long wait as it tried to connect to the invalid IP, timed out, and then retried with a valid IP.

So it is worth checking that all the IP's listed when you dig your domain, are resolving correctly.

Luke
  • 3,333
  • 2
  • 28
  • 44