AWS system failing on HEAD request but hardly on GET requests on stress test

Question

I'm running a stress test with Locust on:

c4.xlarge (attacker has c4.4xlarge)
1 instance
amazonlinux 2017.03

The load balancer is:

classic type
internet-facing
stickiness is disabled for both 80 & 443
80 is forwarded to 80
443 is forwarded to 80
idle timeout is 60s
cross zone load balancing is enabled
access logs disabled
connection draining is enabled with timeout 300 seconds
health check is configured as:
- Ping Target: HTTP:80/status.html
- Timeout:5 seconds
- Interval:30 seconds

I do simple HEAD and GET requests to a /status.html endpoint with the same distribution numbers 25000 users with 1000 spawned per second. For the head requests I get a lot of these errors:

504, GATEWAY_TIMEOUT
408, REQUEST_TIMEOUT
503, Service unavailable: Back-end server is at capacity

The error rate is at about 10%. But strangely for the GET request I get hardly any errors. It is not even 1%.

Why would that happen?

If you need more details about the setup I can provide it. Unfortunately I am very new to AWS so I do not know what to provide. Sorry!

Here is some access logs from the production elb before the problem occurs, sorry I couldn't get the logs on stress test so far.

here a 504:

2019-04-26T02:41:20.330496Z XXX xxx.xxx.xxx.xxx:63054 xxx.xxx.xxx.xxx:80 0.000101 31.01214 0.00002 200 200 166 160 "POST https://my.endpoint.com/some_script HTTP/1.1" "Java/1.6.0_26" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:40.594005Z XXX xxx.xxx.xxx.xxx:50071 xxx.xxx.xxx.xxx:80 0.00006 10.751718 0.000021 200 200 0 159 "GET https://my.endpoint.com/some_script HTTP/1.1" "GuzzleHttp/6.3.3 curl/7.40.0 PHP/5.5.25" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:20.446229Z XXX xxx.xxx.xxx.xxx:63063 xxx.xxx.xxx.xxx:80 0.000065 30.900277 0.00002 200 200 0 161 "GET https://my.endpoint.com/some_script HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:20.517259Z XXX xxx.xxx.xxx.xxx:56506 xxx.xxx.xxx.xxx:80 0.000053 30.829553 0.000018 200 200 0 161 "GET https://my.endpoint.com/some_script HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:42.652118Z XXX xxx.xxx.xxx.xxx:50120 xxx.xxx.xxx.xxx:80 0.000069 8.69724 0.000024 401 401 60 48 "POST https://my.endpoint.com/some_script HTTP/1.1" "go-resty/1.10.2 (https://github.com/go-resty/resty)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:51.360268Z XXX xxx.xxx.xxx.xxx:45201 - -1 -1 -1 504 0 146 0 "POST https://my.endpoint.com/some_script HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:51.361199Z XXX xxx.xxx.xxx.xxx:50120 - -1 -1 -1 504 0 146 0 "POST https://my.endpoint.com/some_script HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2

and later 503:

2019-04-26T02:41:44.490135Z XXX xxx.xxx.xxx.xxx:50044 xxx.xxx.xxx.xxx:80 0.000062 28.220316 0.000019 200 200 0 320 "GET https://my.endpoint.com/some_script HTTP/1.1" "restify/4.3.1 (x64-linux; v8/5.1.281.111; OpenSSL/1.0.2n) node/6.12.3" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:32.082311Z XXX xxx.xxx.xxx.xxx:32882 xxx.xxx.xxx.xxx:80 0.000031 40.62881 0.000022 200 200 117 160 "POST https://my.endpoint.com/some_script HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:43.859743Z XXX xxx.xxx.xxx.xxx:32781 xxx.xxx.xxx.xxx:80 0.000077 28.851417 0.000015 200 200 184 78 "POST https://my.endpoint.com/some_script HTTP/1.1" "GuzzleHttp/6.2.1 curl/7.53.1 PHP/5.6.38" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:51.532781Z XXX xxx.xxx.xxx.xxx:51094 xxx.xxx.xxx.xxx:80 0.000027 21.178497 0.000014 200 200 0 159 "GET https://my.endpoint.com/some_script HTTP/1.1" "GuzzleHttp/6.3.3 curl/7.40.0 PHP/5.5.25" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:51.568865Z XXX xxx.xxx.xxx.xxx:45267 xxx.xxx.xxx.xxx:80 0.000026 21.142531 0.00002 200 200 0 159 "GET https://my.endpoint.com/some_script HTTP/1.1" "GuzzleHttp/6.3.3 curl/7.40.0 PHP/5.5.25" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:46.195626Z XXX xxx.xxx.xxx.xxx:55182 xxx.xxx.xxx.xxx:80 0.000084 26.516262 0.000017 200 200 269 160 "POST https://my.endpoint.com/some_script HTTP/1.1" "Apache-HttpClient/4.0.3 (java 1.5)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:41:42.982043Z XXX xxx.xxx.xxx.xxx:56428 xxx.xxx.xxx.xxx:80 0.000114 29.747779 0.000019 200 200 107 305 "POST https://my.endpoint.com/some_script HTTP/1.1" "Apache-HttpClient/4.0.3 (java 1.5)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:42:13.543180Z XXX xxx.xxx.xxx.xxx:47351 - -1 -1 -1 503 0 0 0 "POST https://my.endpoint.com/some_script HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
2019-04-26T02:42:13.587978Z XXX xxx.xxx.xxx.xxx:47351 - -1 -1 -1 503 0 0 0 "POST https://my.endpoint.com/some_script HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2

You've said you're getting an autoscaling group failure, but ASG only increases server counts based on load. Do you mean you're getting a load balancer error? What kind of load balancer is it? — Tim, Apr 25 '19 at 05:30
I'm not sure what is failing. I only know it is an ASG so I suppose it might be that? I will add the load balancer info. — steros, Apr 25 '19 at 07:47
I have also edited the title. I hope it is more appropriate! — steros, Apr 25 '19 at 07:56
Your title suggests the instance is failing, but are you sure it's the instance? Could it be the load balancer? I wonder if it discards HEAD requests if it is too busy. You also can't just throw a huge load at a load balancer, you have to ramp it up or ask AWS to provision it for large load in advance. I suggest you run a 24 hour load test, ramping up slowly from zero to your target load over that time. Monitor errors as they change with time. — Tim, Apr 25 '19 at 08:10
Do you may have a suggestion for a more appropriate title? I'll try to read up on the load balancer provisioning. Actually the real failure case that is the reason I tested this is a sudden spike in request numbers leads to an ASG group of 10 instances to constantly terminating instances and launching new ones in a loop. — steros, Apr 25 '19 at 09:28
*"access logs disabled"* would seem like the place to start. Turn them on and review them, as well as any logs on your web servers. Also, is there a reason for choosing a Classic balancer? Unless you have a specific reason, you should probably be using an Application Load Balancer (ALB). — Michael - sqlbot, Apr 25 '19 at 13:24
Thank you, I added logs from production as I'm not currently able to activate the logs on the stress test server. But maybe it helps anyway. I have checked logs on the instance application wise, but so far that did not turn out very helpful. So I thought I check if the same error occurs on a simple http head or get request. As you see I get the same errors but the cause might be different maybe? But I asked as I found it strange that the head request produced suddenly much more errors. Thank you for your input! — steros, Apr 26 '19 at 05:57
Have a look at my answer. Ramp up load from zero to your fairly high load over a few hours, rather than throwing huge load at it. Monitor the error rate, perhaps using CloudWatch Logs, to see if it increases with load. I suspect your problem is your testing metholody rather than you infrastructure setup. — Tim, Apr 26 '19 at 07:27

score 2 · Answer 1 · answered Apr 25 '19 at 18:27

Here's my best guess. It's a guess based on what little you've told us, whereas really you need to be looking at auto-scaling group logs and instance logs.

Theory

I suspect that you're slamming a high volume of requests at a load balancer without warming it. You have two options:

Pre-warm a load balancer so it can scale, by contacting AWS and telling them the load you expect
Increase your load gradually so the load balancer has a chance to scale. This can take longer than you expect, so I suggest initially you scale up over at least a few hours to high request rates.

ELB Background

Initially a load balancer may be one small virtual server in each AZ distributing the load. As your load increases AWS behind the scenes adds either more or larger servers to take your load, which is why DNS can change regularly. If you throw a huge load at these small servers they will fail, likely prioritizing traffic, and a GET request is typically more important than a HEAD request.

Links

There is a relevant thread on the AWS forums, which seems to support my theory.

You should also look at the AWS network stress page to ensure you're not creating a DDOS.

AWS system failing on HEAD request but hardly on GET requests on stress test

1 Answers1