Google Cloud Load Balancer suddenly throwing 504 timeouts (despite no change in architecture)

Question

Problem

Sometime overnight, my service began throwing 504 errors on longer running (30+ second) requests despite no recent changes in architecture.

Setup

GCP Cloud Run configured with 3600 second timeout
GCP Load Balancer with serverless NEG pointing to Cloud Run

Troubleshooting

If I hit the Cloud Run-generated URL directly, I can successfully execute 30+ second requests. If I instead hit the public URL (and thus the load balancer), I get timeouts on longer requests.

Musings

Looking at Google Cloud release notes, there was a change to Load Balancing, but nothing related to timeouts: https://cloud.google.com/release-notes

The current setup has been working flawlessly for over a year.

Update

Modifying the backend timeout is disabled for serverless NEGs, see below: screenshot

Update 2

This seems like a bug with GCP load balancing introduced during the last update, as the default timeout should be 60 minutes, not 30 seconds, as per the documentation: screenshot

Hey, we are having same issue like you. What is interesting is that we have LB configured on staging (classic one) and it seems to work fine. We even tried to redeploy new load balancer on staging with same config like on prod, and it is working fine. Therefore, looking for any solution. Atm we are planing to redeploy LB on prod over the night.. — Baki, Apr 17 '23 at 09:24
There is also a issue tracker opened at: https://issuetracker.google.com/issues/278146890 — Baki, Apr 17 '23 at 09:35
@Baki - Hah, yes, I opened that issue after suspecting a bug (see "Update 2") — Rusty Moorman, Apr 17 '23 at 11:15

score 2 · Accepted Answer · answered Apr 17 '23 at 10:38

Since the issue was related to our prod environment, we have created new load balancer, this time we picked global (classic) LB as an option, got new IP address, and swapped old one at our dns provider. After that everything works. Will probably delete previous LB. I am aware this is not a fix to the problem but more of a workaround, but hey, we got production working and exporting data like before.

score 0 · Answer 2 · answered Apr 19 '23 at 14:37

In addition to @Baki's Load Balancer redeployment workaround. I would like to confirm that the default timeout for the Serverless NEG backend service is 3600 seconds or 1 hour as stated in the documentation despite the console only showing 60 seconds. You may also override the timeout using route action timeouts. This appears to be a UI issue at the moment. For other users experiencing the same issue, it will be helpful to check the issue tracker submitted by @RustyMoorman and follow its updates by clicking the +1 and star buttons.