2

EDIT

It seems that my first error I describe is very easy to reproduce. Actually, Google Run fails to run any GRPC query on a .NET5 GRPC server it seems (at least, it did work before but as of today, February 21st, it seems that something changed). To reproduce:

  1. Create a .NET5 GRPC server (also fails with .NET6):
dotnet new grpc -o TestGrpc
  1. Change Program.cs so that it listens on $PORT, typically:
        public static IHostBuilder CreateHostBuilder(string[] args)
        {
            var port = Environment.GetEnvironmentVariable("PORT") ?? "8080";
            var url = string.Concat("http://0.0.0.0:", port);
            return Host.CreateDefaultBuilder(args)
                .ConfigureWebHostDefaults(webBuilder =>
                {
                    webBuilder.UseStartup<Startup>().UseUrls(url);
                });
        }
  1. A very simple Dockerfile to have an image for the server (fails also with a more standard one, like here):
FROM mcr.microsoft.com/dotnet/sdk:5.0

COPY . ./

RUN dotnet restore ./TestGrpc.csproj
RUN dotnet build ./TestGrpc.csproj -c Release
CMD dotnet run --project ./TestGrpc.csproj
  1. Build and push to Google Artifcats Registry.
  2. Create a Cloud Run instance with HTTP/2 enabled (Ketrel requires HTTP/2 so we need to set HTTP/2 end-to-end, yet I tested without as well but it's not better).
  3. Use Grpcurl for instance and try:
grpcurl {CLOUD_RUN_URL}:443 list

And you will obtain the same error as I got with my (more complex) project:

Failed to list services: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: remote reset

On the Google Cloud Run instance I only have the log:

2022-02-21T16:44:32.528530Z POST 200 1.02 KB 41 ms grpcurl/v1.8.6 grpc-go/1.44.1-dev https://***/grpc.reflection.v1alpha.ServerReflection/ServerReflectionInfo

(I don't really understand why it's a 200 though... and never seems to reach the actual server implementation, just as if there was some kind of middleware blocking the query to reach the implementation... )

I'm pretty sure this used to worked as I started my project this way (and then changed the protos, the service, etc. ). If anyone has a clue I'd be more than grateful :-)


INITIAL POST (less precise than explanations above but I leave it here if it may give clues)

I have a server running within a Docker (.NET5 GRPC application). This server, when deployed locally works perfectly fine. But recently I have an error when I deploy it on Google Cloud Run: upstream connect error or disconnect/reset before headers. reset reason: remote reset when it was working fine before. I keep on having this error from any client I use, for instance with Curl:

curl -v https://{ENDPOINT}/{Proto-base}/{Method} --http2


*   Trying ***...
* TCP_NODELAY set
* Connected to *** (***) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.a.run.app
*  start date: Feb  7 02:07:06 2022 GMT
*  expire date: May  2 02:07:05 2022 GMT
*  subjectAltName: host "***" matched cert's "*.a.run.app"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1C3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x5564aad30860)
> GET /{Proto}/{Method} HTTP/2
> Host: ***
> user-agent: curl/7.68.0
> accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 503 
< content-length: 85
< content-type: text/plain
< date: Mon, 21 Feb 2022 13:51:31 GMT
< server: Google Frontend
< traceparent: 00-5a74487dafb5687961deeb17e0158ca9-5ab63cd23680e7d7-01
< x-cloud-trace-context: 5a74487dafb5687961deeb17e0158ca9/6536478782730069975;o=1
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
<
* Connection #0 to host *** left intact
upstream connect error or disconnect/reset before headers. reset reason: remote reset

Same happens with Grpcurl:

grpcurl ***:443 list {Proto-base}

Failed to list methods for service "***.Company": rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: remote reset

I cannot find much resource on this error as most of the threads I read deal with another type of reset reason (like protocol, or connection, etc.). But I strictly have no idea what remote reset means and what I did wrong.

Looking at the logs in Google Cloud Run, I can see that the server is definitly hit, though I added trace logging in the route which is not triggered, hence it never reaches my code:

2022-02-21T14:44:22.840580Z  POST 200 1.01 KB 1 msgrpc-python/1.44.0 grpc-c/22.0.0 (linux; chttp2) https://***/{Protos-base}/{Method}

(if I reached my code it should print some "Hellos" everywhere which it doesn't)

Has anyone ever found this?

P.S.: there are many things around about Envoy, but I don't even use this. I simply have a Cloud Run instance (with HTTP/2 - and I tried without but it fails due to protocol issue).

Vince.Bdn
  • 1,145
  • 1
  • 13
  • 28
  • 'when it was working fine before' do you mean your server in cloud worked well before? – Lei Yang Feb 21 '22 at 15:44
  • @LeiYang Yes exactly. It's still a sandbox project though so it's quite hard to track when it started failing properly. I can't think of any configuration change on GCP I did and as for the .NET code, I merely added a few routes to the service. – Vince.Bdn Feb 21 '22 at 16:01
  • To add some context: I started again from [the official doc](https://github.com/GoogleCloudPlatform/dotnet-docs-samples/tree/main/run/helloworld) and it works fine with HTTP/1.1 standard Kestrel web-app. A simple change with `builder.Services.AddGrpc()` (which forces Kestrel to HTTP/2) breaks everything! If we don't set end-to-end HTTP/2 then we have a `protocol error` (normal Kestrel needs HTTP/2 for GRPC), if we do we have a `remote reset` as above... – Vince.Bdn Feb 22 '22 at 14:44
  • if i understand correctly, [GoogleCloudPlatform/dotnet-docs-samples](https://github.com/GoogleCloudPlatform/dotnet-docs-samples) are samples to call google **api**, but your code above is just a console application possibly **hosted** in google cloud. they are totally different things. – Lei Yang Feb 22 '22 at 15:00
  • @LeiYang they are samples of serveral things ;) But in particular here: https://github.com/GoogleCloudPlatform/dotnet-docs-samples/tree/main/run/helloworld it is a sample of a Kestrel web-server, dockerized and that can be pushed on Artifacts Registry to be served on Google Cloud Run. – Vince.Bdn Feb 22 '22 at 15:06
  • this sample is an asp.net core application, not grpc related. and i suppose google cloud hosting is something like kubernetes, and there are a lot of networking configurations not shown here. – Lei Yang Feb 22 '22 at 15:09
  • @LeiYang Yes agreed, it's a mere Kestrel HTTP/1.1 web server. Yet, the documentation says it supports GRPC (https://cloud.google.com/run/docs/triggering/grpc), which typically requires nothing more than HTTP/2 (for Kestrel). That's why I made sure first that the .NET HTTP/1.1 web server works as expected, then make it an HTTP/2 with GRPC handler and it does fail (when it used to work). – Vince.Bdn Feb 22 '22 at 15:13
  • @LeiYang I agree with you, there's probably a bunch of network we're hidden from. That's typically I think where something might have changed and broken the expected behavior. – Vince.Bdn Feb 22 '22 at 15:15
  • if i were you, i'd follow tutorials of [microsoft](https://learn.microsoft.com/en-us/aspnet/core/grpc/basics?view=aspnetcore-6.0) and [grpc official](https://grpc.io/docs/languages/csharp/quickstart/). – Lei Yang Feb 22 '22 at 15:18
  • @LeiYang Yup sure, that's what I summarized above (with `dotnet new grpc -o TestGrpc` it's exactly the tutos of Microsoft). Everything works as expected locally, it's just when I deploy it on Google Cloud Run that it fails... – Vince.Bdn Feb 22 '22 at 15:24
  • so you have to read google cloud hosting docs. not api docs. can you paste how you use `gcloud` command? – Lei Yang Feb 22 '22 at 15:34
  • My gRPC Cloud Run service stopped working over the weekend too. My Powershell `gcloud` deployment cmd is similar to this: ``` gcloud beta run deploy --project=$ProjectName --platform=managed --port=80 --use-http2 --allow-unauthenticated --min-instances=$MinimumInstances --max-instances=$MaximumInstances --cpu=$CPUs --memory=$Memory --region=$GoogleRegion $ServiceName --image gcr.io/$ProjectName/${BaseImageName}:$BUILD_VERSION-$BUILD_VERSION_SUFFIX --service-account=$ServiceAccount --timeout=$RequestTimeout ``` @LeiYang not sure if that is what you are looking for. – Craig Feb 22 '22 at 17:27
  • @LeiYang @Craig I'm using kind of the same command. I think it stopped working in the weekend for me and the more I investigate the more I think it's due to `--use-http2` (end-to-end HTTP/2). I opened a github issue [here](https://github.com/GoogleCloudPlatform/dotnet-docs-samples/issues/1644), and hopefully they can provide a workaround or a quick fix. – Vince.Bdn Feb 22 '22 at 17:53
  • @Craig not sure it's really comforting but there are quite many people that got impacted all of a sudden this weekend, e.g. https://stackoverflow.com/questions/71215192/unable-to-connect-to-basic-google-cloud-run-service-upstream-connect-error-or-d#71219829 – Vince.Bdn Feb 22 '22 at 19:43
  • @Vince.Bdn Thanks, you're right - not comforting, but at least "it's not me" :) Is it a subtle hint from Google to switch over to Kubernetes Engine - so I can pay for 3 server instances all the time? – Craig Feb 22 '22 at 20:02
  • 1
    Oh, I forgot to mention: I found an 'issue' in their Issue Tracker. Go and Star it. Maybe we can get some attention. https://issuetracker.google.com/issues/220724779 – Craig Feb 22 '22 at 20:04
  • @Vince.Bdn There is a work-around if you can use .NET 6 posed in the Incident tracker report : https://status.cloud.google.com/incidents/qfgJm8m4WPn2Ej2Z7vc2 .NET workaround link (from above) https://learn.microsoft.com/en-us/dotnet/api/microsoft.aspnetcore.server.kestrel.core.kestrelserveroptions.allowalternateschemes?view=aspnetcore-6.0 – Craig Feb 23 '22 at 20:54

1 Answers1

1

It is an actual bug from Envoy and Google Cloud Run. There is a quick fix if you're using .NET6, otherwise it's a bit more hacky. I will just copy here the answer provided by Amanda Tarafa Mas from Google Cloud Platform on the github issue I opened:

Here are the potential fixes:

  • When using .NET 6 you can set KestrelServerOptions.AllowAlternateSchemes to true.
  • If on a lower .NET version, consider something like GRPC :scheme pseudo-header passed from proxy/loadbalancer causes ConnectionAbortedException dotnet/aspnetcore#30532 (comment). Or consider upgrading to .NET 6.

What's happening:

For me setting KestrelServerOptions.AllowAlternate was enough to make my GRPC server work again.

As @Craig said, you can track the issue here and see if it gets resolved.

Vince.Bdn
  • 1,145
  • 1
  • 13
  • 28