I'm using EKS to deploy a service, with ingress running on top of alb-ingress-controller.
All in all I have about 10 replicas of a single pod, with a single service of type NodePort
which forwards traffic to them. The replicas run on 10 nodes, established with eksctl, and spread across 3 availability zones.
The problem I'm seeing is very strange - inside the cluster, all the logs show that requests are being handled in less than 1s, mostly around 20-50 millis. I know this because I used linkerd to show the percentiles of request latencies, as well as the app logs themselves. However, the ALB logs/monitoring tell a very different story. I see a relatively high request latency (often approaching 20s or more), and often also 504 errors returned from the ELB (sometimes 2-3 every 5 minutes).
When trying to read the access logs for the ALB, I noticed that the 504 lines look like this:
https 2019-12-10T14:56:54.514487Z app/1d061b91-XXXXX-3e15/19297d73543adb87 52.207.101.56:41266 192.168.32.189:30246 0.000 -1 -1 504 - 747 308 "GET XXXXXXXX" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:750977848747:targetgroup/1d061b91-358e2837024de757a3d/e59bbbdb58407de3 "Root=1-5defb1fa-cbcdd248dd043b5bf1221ad8" "XXXX" "XXXX" 1 2019-12-10T14:55:54.514000Z "forward" "-" "-" "192.168.32.189:30246" "-"
Where the request processing time is 0 and the target processing time is -1, indicating the request never made it to the backend, and response was returned immediately.
I tried to play with the backend HTTP keepalive timeout (currently at 75s) and with the ALB idle time (currently at 60s) but nothing seems to change much for this behavior.
If anyone can point me to how to proceed and investigate this, or what the cause can be, I'd appreciate it very much.