1

I've been successfully running a gRPC service on GCP Cloud Run for over a year. Suddenly, it stopped working and responds to each request with...

StatusCode="Unavailable", Detail="upstream connect error or disconnect/reset before headers. reset reason: remote reset"

There was no new revision or deployment. It just started responding this way out of the blue. There is no proxy, no VPC, no gateway, no ingress controller, I'm just using the URL provided by Cloud Run with port 443 specified. It's the simplest deploy possible.

I've tried disabling end to end HTTP/2 (which worked previously), create a brand new service instance with a new name, change runtime environment generations, all haven't moved me any closer to a resolution. I have not migrated to using ESPv2, so this should not be a concern either.

What could possibly be causing this?

Jyothi Kiranmayi
  • 2,090
  • 5
  • 14
gfish3000
  • 1,557
  • 1
  • 11
  • 22
  • What is going on inside your container? If you manually start an instance of it in a Compute Engine and then try and invoke it ... what happens? – Kolban Feb 22 '22 at 03:08
  • According to the container logs, everything is fine. It does its initial database fetch and waits for a call. When hit, it registers the endpoint being called, then... nothing. – gfish3000 Feb 22 '22 at 04:40
  • Since Cloud Run basically manages a Docker container on your behalf, what about running that container "outside" of Cloud Run? Create a Compute Engine and specify the image in Artifact registry to run. Then you will be able to hit an instance of the container directly and also "shell exec" into the container to do some debugging. – Kolban Feb 22 '22 at 04:54
  • This exact container worked perfectly fine for a week until one night it just suddenly stopped. If the problem was with the container, it should've been happening either as soon as it deployed, or very shortly after. – gfish3000 Feb 22 '22 at 06:54
  • Hi, I have the very same issue, I posted [here](https://stackoverflow.com/questions/71207574/grpc-on-google-cloud-run-upstream-connect-error-or-disconnect-reset-before-hea?noredirect=1#comment125899400_71207574) on stackoverflow and I opened a Github issue [here](https://github.com/GoogleCloudPlatform/dotnet-docs-samples/issues/1644) – Vince.Bdn Feb 22 '22 at 18:49
  • I definitely think it's due to some internal change on HTTP/2 (which I'm forced to use because I have a Kestrel web-server which needs it for GRPC)... The thing is, there are many of us who have seen their Cloud Run instance stop working lately... :'( – Vince.Bdn Feb 22 '22 at 18:51
  • 1
    Here is a link to the [issue tracker](https://issuetracker.google.com/issues/220724779) `https://issuetracker.google.com/issues/220724779` – Craig Feb 22 '22 at 20:05

5 Answers5

1

All right, after a lot of investigation and with some vert helpful links from the community, turns out there was a breaking error under the hood of GCP Cloud Run for Envoy. Some headers were removed, which led to some gRPC calls not being routed correctly. Since I'm using .NET 6, I was able to use the following addition to my Kestrel config...

webBuilder.ConfigureKestrel(options =>
{
    ...
    options.AllowAlternateSchemes = true;
});

... to solve the issue, as per Google's recommendation. They are supposed to fix the underlying error on the day this is being posted. Thank you all.

gfish3000
  • 1,557
  • 1
  • 11
  • 22
0

I've previously had this exact same issue you're describing - spent hours debugging/redeploying etc. I verified the GRPC server was returning successfully, and went down a rabbit-hole of .net core's handling of http2 cleartext and tls negotiation/downgrading (since cloudrun terminates TLS and .net core GRPC seems to hate unencrypted HTTP2 payloads) - this led me mostly nowhere and fixed nothing.

At the end of it all, I came back the next day - redeployed some old revisions (that were previously broken) and it all worked.

My assumption is something is going on the cloudrun side of things... but not sure.

(Obviously not a good answer, but don't have the reputation to comment).

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 22 '22 at 17:21
  • @Michael : was that last summer when you "came back the next day and it was working"? Last year I had added some Rest API endpoints to allow development to continue during that HTTP2 issue on Cloud Run. I took them out about 5 weeks ago :/ I am redoing that "HTTP1.1 Web API fail-over" work again, and adding newer endpoints for the new work that has happened. – Craig Feb 23 '22 at 18:01
  • Yup, I think it was around then. – Michael Lentin Feb 24 '22 at 03:25
0

When a new connection is opened, Cloud Run checks if it has content-type=application/grpc h2 header in the first request. Once Cloud Run matches gRPC, it sends all the connections to the gRPC server.

Sometimes, disabling the Cloud Run http/2 end-to-end (which you mentioned you already tried) resolves the issue as suggested in the ServerFault case.

The error which you are receiving mostly seems to indicate that ESPv2 can't reach the service's backend.

As a workaround, I suggest you the below ways to mitigate the error.

  1. Configure ESPv2 with the correct backend address as suggested here.

  2. In case if ESPv2 is already configured, force it to use IPv4 addresses via the --backend_dns_lookup_family_flag. You can check more details under the “DNS lookup” section in this documentation.

  3. Configure your requests to be gRPC requests.

Also, have a look at this GitHub Link.

Mousumi Roy
  • 609
  • 1
  • 6
  • ESPv2 is not enabled for this instance, and I checked that the service and container has a content-type=application/grpc h2 header. I've seen the link and while it looks like the exact same issue, the only link out is private so it yields absolutely nothing actionable. – gfish3000 Feb 22 '22 at 18:38
0

#NotAnAnswer

Here is a link to the incident details

https://status.cloud.google.com/incidents/qfgJm8m4WPn2Ej2Z7vc2

It started on Feb 10th!

Craig
  • 63
  • 7
  • If you can use .NET 6 there is a workaround posted in that Incident Report: https://learn.microsoft.com/en-us/dotnet/api/microsoft.aspnetcore.server.kestrel.core.kestrelserveroptions.allowalternateschemes?view=aspnetcore-6.0 – Craig Feb 23 '22 at 20:54
0

I don't know for sure if you were having the same issue as I have but this error tend to happen since 04/15/2021 (or so) on a .NET GRPC or HTTP/2 server.

Here is the answer I wrote on my own question on stackoverflow: https://stackoverflow.com/a/71249250/4829094. You'll find there support for Google Cloud Platforms answers and fix roadmap.

Vince.Bdn
  • 1,145
  • 1
  • 13
  • 28
  • I was able to use `options.AllowAlternateSchemes = true;` in my Kestrel config, and the service is now working. Thank you. – gfish3000 Feb 25 '22 at 00:48