5

I have an application (let's call it client) connecting to another process (let's call it server) on the same machine via gRPC. The communication goes over unix socket.

If server is restarted, my client gets an EOF and does not re-establish the connection, although I expected the clientConn to handle the reconnection automatically.

Why isn't the dialer taking care of the reconnection? I expect it to do so with the backoff params I passed.

Below some pseudo-MWE.

  • Run establish the initial connection, then spawns goroutineOne
  • goroutineOne waits for the connection to be ready and delegates the send to fooUpdater
  • fooUpdater streams the data, or returns in case of errors
  • for waitUntilReady I used the pseudo-code referenced by this answer to get a new stream.
func main() {
    go func() {
        if err := Run(ctx); err != nil {
            log.Errorf("connection error: %v", err)
        }
        ctxCancel()
    }()
// some wait logic
}


func Run(ctx context.Context) {
    backoffConfig := backoff.Config{
        BaseDelay:  time.Duration(1 * time.Second),
        Multiplier: backoff.DefaultConfig.Multiplier,
        Jitter:     backoff.DefaultConfig.Jitter,
        MaxDelay:    time.Duration(120 * time.Second),
    }

    myConn, err := grpc.DialContext(ctx,
        "/var/run/foo.bar",
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithConnectParams(grpc.ConnectParams{Backoff: backoffConfig, MinConnectTimeout: time.Duration(1 * time.Second)}),
        grpc.WithContextDialer(func(ctx context.Context, addr string) (net.Conn, error) {
            d := net.Dialer{}
            c, err := d.DialContext(ctx, "unix", addr)
            if err != nil {
                return nil, fmt.Errorf("connection to unix://%s failed: %w", addr, err)
            }
            return c, nil
        }),
    )
    if err != nil {
        return fmt.Errorf("could not establish socket for foo: %w", err)
    }
    defer myConn.Close()
    return goroutineOne()
}

func goroutineOne() {
    reconnect := make(chan struct{})

    for {
        if ready := waitUntilReady(ctx, myConn, time.Duration(2*time.Minute)); !ready {
            return fmt.Errorf("myConn: %w, timeout: %s", ErrWaitReadyTimeout, "2m")
        }
        go func() {
            if err := fooUpdater(ctx, dataBuffer, myConn); err != nil {
                log.Errorf("foo updater: %v", err)
            }
            reconnect <- struct{}{}
        }()

        select {
        case <-ctx.Done():
            return nil
        case <-reconnect:
        }
    }
}

func fooUpdater(ctx context.Context, dataBuffer custom.CircularBuffer, myConn *grpc.ClientConn) error {
    clientStream, err := myConn.Stream(ctx) // custom pb code, returns grpc.ClientConn.NewStream(...)
    if err != nil {
        return fmt.Errorf("could not obtain stream: %w", err)
    }
    for {
        select {
        case <-ctx.Done():
            return nil
        case data := <-dataBuffer:
            if err := clientStream.Send(data); err != nil {
                return fmt.Errorf("could not send data: %w", err)
            }
        }
    }
}

func waitUntilReady(ctx context.Context, conn *grpc.ClientConn, maxTimeout time.Duration) bool {
    ctx, cancel := context.WithTimeout(ctx, maxTimeout)
    defer cancel()

    currentState := conn.GetState()
    timeoutValid := true

    for currentState != connectivity.Ready && timeoutValid {
        timeoutValid = conn.WaitForStateChange(ctx, currentState)
        currentState = conn.GetState()
        // debug print currentState -> prints IDLE
    }

    return currentState == connectivity.Ready
}

Debugging hints also welcome :)

gtatr
  • 6,947
  • 1
  • 17
  • 27
  • Did you have a chance to debug it? What is the last thing that gets executed after the client gets `EOF`? – Emin Laletovic Dec 28 '22 at 19:28
  • `waitUntilReady` is executed last, then it returns and I get the `connection error` print from my `main` after the `maxTimeout` – gtatr Jan 03 '23 at 07:51

2 Answers2

0

Based on the provided code and information, there might be an issue with how ctx.Done is being utilized.

The ctx.Done() is being used in fooUpdater and goroutineOnefunctions. When connection breaks, I believe that the ctx.Done() gets called in both functions, with the following execution order:

Connection breaks, the ctx.Done case in the fooUpdater function gets called, exiting the function. The select statement in the goroutineOne function also executes the ctx.Done case, which exists the function, and the client doesn't reconnect.

Try debugging it to check if both select case blocks get executed, but I believe that is the issue here.

Emin Laletovic
  • 4,084
  • 1
  • 13
  • 22
0

According to the GRPC documentation, the connection is re-established if there is a transient failure otherwise it fails immediately. You can try to verify that the failure is transient by printing the connectivity state.

You should print the error code also to understand Why RPC failed.

Maybe what you have tried is not considered a transient failure.

Also, according to the following entry retry logic does not work with streams: grpc-java: Proper handling of retry on client for service streaming call

Here are the links to the corresponding docs: https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html https://pkg.go.dev/google.golang.org/grpc#section-readme

Also, check the following entry: Ways to wait if server is not available in gRPC from client side

Kadir Korkmaz
  • 65
  • 1
  • 7
  • ```according to the following entry retry logic does not work with streams: grpc-java: Proper handling of retry on client for service streaming call ``` So what you are saying, according to the other answer, is that once a client is receiving from a specific server, it is _bound_ to that, and won't try connecting to a different one? Even if it is on the same unix socket address? – gtatr Jan 03 '23 at 07:42
  • I am not sure, but I guess the semantics of retry logic is not well-defined in the case of streams. [https://learn.microsoft.com/en-us/aspnet/core/grpc/retries?view=aspnetcore-7.0#when-retries-are-valid](Look at this c# doc) they clearly state that when retry is valid. But go doc is not clear about it. Also, I could not find the definition of transient failure in the docs. – Kadir Korkmaz Jan 03 '23 at 15:59