504 gateway timeout error on aws in specific region

Question

Current environment:

Node.js API server is hosted on EC2 instance (Ubuntu20.04) with Load balancer and Security Group, we are serving as HTTPS. Front End is on S3 linked with Cloudfront.

These two servers are using Route 53 as a DNS provider and everything works well.

Problem:

Everything works fine, but it's not working properly in a specific region - South Windsor CT, US. (Internet provider is Cox Cable). Frequently API requests return 504(Gateway timeout) error without any reason. UI works well. Only API requests. But it works in different regions eg, Mexico and Russia.

I've tried many things on Load balancer, but actually there's no 504 error spotted on the Load balancer log. (I have checked that in cloudwatch). This means requests didn't arrive. Maybe Route 53 bug? There's only thing configured - CNAME, nothing else. And why this happening for a specific region?

Any experiences are all welcome!

Also have you got CloudFront logging enabled. You should be able to get more meta data about the requests — Chris Williams, May 25 '20 at 14:19
Actually, problem happening on API server and it's not related to Cloudfront. EC2 + LoadBalancer + Route 53 — Liu Zhang, May 25 '20 at 14:21
OK, so can you enable the logs on the ELB. 504 timeout would normally mean your ELB timed out connecting to the target group. — Chris Williams, May 25 '20 at 14:23
actually ELB never gets error, seems like request never reach to ELB when 504. — Liu Zhang, May 25 '20 at 14:34
Can you reproduce it from the same location? Enable VPC flow logs would be the next step and isolating to the subnets where your load balancer resides. Then looking for rejected traffic — Chris Williams, May 25 '20 at 14:35
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/214590/discussion-between-blackiii-and-mokugo-devops). — Liu Zhang, May 25 '20 at 14:49
We had a similar error and I our case the reason was a misconfigured NACL / Security Group of the Load Balancer. The issue was: one particular load balancer subnet was not accessible publicly. The error occurred only a few times, because modern clients were intelligent enough to recognise that other endpoints (from the multivalue DNS record) are working and did not retry the other IPs. — Martin Löper, May 29 '20 at 14:38

score 1 · Answer 1 · answered Jun 03 '20 at 13:19

Cause 1: The application takes longer to respond than the configured inactivity timeout.

Solution 1: Monitor the HTTPCode_ELB_5XX and Latency metrics. If there is an increase in these metrics, it may be because the application has not responded within the inactivity timeout period. For details on requests that exceed this limit, enable access logs on the load balancer and review the 504 response codes in the logs generated by Elastic Load Balancing. If necessary, you can increase capacity or increase the configured downtime.

Cause 2: registered instances are closing the connection to Elastic Load Balancing.

Solution 2: enable the keep-alive settings on EC2 instances and verify that the keep-alive timeout is longer than the load balancer inactivity timeout settings.

Notes:

Check the configurations on your firewall, security groups, and origin server to identify the source of the errors
If you're receiving HTTP 504 errors from CloudFront, but you can connect directly to the origin, then consider increasing the distribution's origin response timeout. By default, CloudFront allows you to keep the origin connection open for 30 seconds. If your applications need more than 30 seconds to process and return a response, then CloudFront returns an HTTP 504 error.

504 gateway timeout error on aws in specific region

1 Answers1