5

I'm using EKS to deploy a service, with ingress running on top of alb-ingress-controller.

All in all I have about 10 replicas of a single pod, with a single service of type NodePort which forwards traffic to them. The replicas run on 10 nodes, established with eksctl, and spread across 3 availability zones.

The problem I'm seeing is very strange - inside the cluster, all the logs show that requests are being handled in less than 1s, mostly around 20-50 millis. I know this because I used linkerd to show the percentiles of request latencies, as well as the app logs themselves. However, the ALB logs/monitoring tell a very different story. I see a relatively high request latency (often approaching 20s or more), and often also 504 errors returned from the ELB (sometimes 2-3 every 5 minutes).

When trying to read the access logs for the ALB, I noticed that the 504 lines look like this:

https 2019-12-10T14:56:54.514487Z app/1d061b91-XXXXX-3e15/19297d73543adb87 52.207.101.56:41266 192.168.32.189:30246 0.000 -1 -1 504 - 747 308 "GET XXXXXXXX" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:750977848747:targetgroup/1d061b91-358e2837024de757a3d/e59bbbdb58407de3 "Root=1-5defb1fa-cbcdd248dd043b5bf1221ad8" "XXXX" "XXXX" 1 2019-12-10T14:55:54.514000Z "forward" "-" "-" "192.168.32.189:30246" "-"

Where the request processing time is 0 and the target processing time is -1, indicating the request never made it to the backend, and response was returned immediately.

I tried to play with the backend HTTP keepalive timeout (currently at 75s) and with the ALB idle time (currently at 60s) but nothing seems to change much for this behavior.

If anyone can point me to how to proceed and investigate this, or what the cause can be, I'd appreciate it very much.

vmalloc
  • 820
  • 5
  • 8
  • 1
    *"response was returned immediately."* Check that again. There's a ~1 minute delay between the two timestamps. – Michael - sqlbot Dec 16 '19 at 01:54
  • You are correct. There is a minute's difference - sorry I missed that. However, it seems this corresponds to the ALB's idle timeout. No matter how high I set it - those intermittent 504's follow the same timeout. When I set the idle time to 5 minutes, the difference in the access logs becomes 5 minutes... So it seems the targets aren't reached by the ALB no matter how much time passes... – vmalloc Dec 17 '19 at 06:20

1 Answers1

2

We faced a similar type of issue with EKS and ALB combination. If the target response code says -1, there may be a chance that the request waiting queue is full on the target side. So the ALB will immediately drop the request.

Try to do an ab benchmark by skipping the ALB and directly send the request to the service or the private IP address. Doing this will help you to identify where the problem is.

For us, 1 out of 10 requests failed if we send traffic via ALB. We are not seeing failures if we directly send the request to the service.

AWS recommendation is to use NLB over the ALB. NLB gives more advantages and suitable for Kubernetes. There is a blog which explains this Using a Network Load Balancer with the NGINX Ingress Controller on Amazon EKS

We changed to NLB and now we are not getting 5XX errors.

Sriram G
  • 369
  • 3
  • 14