14

I have a Go gRPC client connected to a gRPC server running in a different pod in my k8s cluster.

It's working well, receiving and processing requests.

I am now wondering how best to implement resiliency in the event that the gRPC server pod gets recycled.

As far as I can ascertain, the clientconn.go code should handle the reconnection automatically, but I just cannot get it to work and I fear my implementation is incorrect in the first instance.

Calling code from main:

go func() {     
        if err := gRPCClient.ProcessRequests(); err != nil {
            log.Error("Error while processing Requests")
            //do something here??
        }
    }()

My code in the gRPCClient wrapper module:

func (grpcclient *gRPCClient) ProcessRequests() error {
    defer grpcclient.Close()    

    for {
        request, err := reqclient.stream.Recv()
        log.Info("Request received")
        if err == io.EOF {          
            break
        }
        if err != nil {
            //when pod is recycled, this is what's hit with err:
            //rpc error: code = Unavailable desc = transport is closing"

            //what is the correct pattern for recovery here so that we can await connection
            //and continue processing requests once more?
            //should I return err here and somehow restart the ProcessRequests() go routine in the 
            //main funcition?
            break
            
        } else {
            //the happy path
            //code block to process any requests that are received
        }
    }

    return nil
}

func (reqclient *RequestClient) Close() {
//this is called soon after the conneciton drops
        reqclient.conn.Close()
}

EDIT: Emin Laletovic answered my question elegantly below and gets it most of the way there. I had to make a few changes to the waitUntilReady function:

func (grpcclient *gRPCClient) waitUntilReady() bool {
ctx, cancel := context.WithTimeout(context.Background(), 300*time.Second) //define how long you want to wait for connection to be restored before giving up
defer cancel()

currentState := grpcclient.conn.GetState()
stillConnecting := true

for currentState != connectivity.Ready && stillConnecting {
    //will return true when state has changed from thisState, false if timeout
    stillConnecting = grpcclient.conn.WaitForStateChange(ctx, currentState)
    currentState = grpcclient.conn.GetState()
    log.WithFields(log.Fields{"state: ": currentState, "timeout": timeoutDuration}).Info("Attempting reconnection. State has changed to:")
}

if stillConnecting == false {
    log.Error("Connection attempt has timed out.")
    return false
}

return true
}
Fin
  • 311
  • 1
  • 2
  • 9

2 Answers2

11

The RPC connection is being handled automatically by clientconn.go, but that doesn't mean the stream is also automatically handled.

The stream, once broken, whether by the RPC connection breaking down or some other reason, cannot reconnect automatically, and you need to get a new stream from the server once the RPC connection is back up.

The pseudo-code for waiting the RPC connection to be in the READY state and establishing a new stream might look something like this:

func (grpcclient *gRPCClient) ProcessRequests() error {
    defer grpcclient.Close()    
    
    go grpcclient.process()
    for {
      select {
        case <- grpcclient.reconnect:
           if !grpcclient.waitUntilReady() {
             return errors.New("failed to establish a connection within the defined timeout")
           }
           go grpcclient.process()
        case <- grpcclient.done:
          return nil
      }
    }
}

func (grpcclient *gRPCClient) process() {
    reqclient := GetStream() //always get a new stream
    for {
        request, err := reqclient.stream.Recv()
        log.Info("Request received")
        if err == io.EOF {          
            grpcclient.done <- true
            return
        }
        if err != nil {
            grpcclient.reconnect <- true
            return
            
        } else {
            //the happy path
            //code block to process any requests that are received
        }
    }
}

func (grpcclient *gRPCClient) waitUntilReady() bool {
  ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) //define how long you want to wait for connection to be restored before giving up
  defer cancel()
  return grpcclient.conn.WaitForStateChange(ctx, conectivity.Ready)
}

EDIT:

Revisiting the code above, a couple of mistakes should be corrected. The WaitForStateChange function waits for the connection state to change from the passed state, it doesn't wait for the connection to change into the passed state.

It is better to track the current state of the connection and use the Connect function to connect if the channel is idle.

func (grpcclient *gRPCClient) ProcessRequests() error {
        defer grpcclient.Close()    
        
        go grpcclient.process()
        for {
          select {
            case <- grpcclient.reconnect:
               if !grpcclient.isReconnected(1*time.Second, 60*time.Second) {
                 return errors.New("failed to establish a connection within the defined timeout")
               }
               go grpcclient.process()
            case <- grpcclient.done:
              return nil
          }
        }
}

func (grpcclient *gRPCClient) isReconnected(check, timeout time.Duration) bool {
  ctx, cancel := context.context.WithTimeout(context.Background(), timeout)
  defer cancel()
  ticker := time.NewTicker(check)

  for{
    select {
      case <- ticker.C:
        grpcclient.conn.Connect()
 
        if grpcclient.conn.GetState() == connectivity.Ready {
          return true
        }
      case <- ctx.Done():
         return false
    }
  }
}
Emin Laletovic
  • 4,084
  • 1
  • 13
  • 22
  • 1
    Thanks, this got me most of the way there. From my testing there was an issue in waitUntilReady(), Firstly WaitForStateChange returns from the supplied state, so it returns immediately due the the state being TransientFailure. To get this working, I introduced a for loop to keep attempting until the state becomes "Ready" See my original post for my version. – Fin Mar 02 '21 at 23:45
  • Hey can you please explain what is 'GetStream()'? What should it return? A new stream when grpc client calls streaming grpc server service? @Fin – Mojtaba Arezoomand May 29 '22 at 22:07
2

When gRPC connection is closed, the state of the gRPC client connection will be IDLE or TRANSIENT_FAILURE. Here is my example for a custom reconnect mechanism for gRPC bi-directional streaming. First, I have a for loop to keep reconnecting until the gRPC server is up, which the state will become ready after calling conn.Connect().

for {
    select {
    case <-ctx.Done():
        return false
    default:
            if client.Conn.GetState() != connectivity.Ready {
                client.Conn.Connect()
            }

            // reserve a short duration (customizable) for conn to change state from idle to ready if grpc server is up
            time.Sleep(500 * time.Millisecond)

            if client.Conn.GetState() == connectivity.Ready {
                return true
            }

            // define reconnect time interval (backoff) or/and reconnect attempts here
            time.Sleep(2 * time.Second)
    }
}

Also, a goroutine will be spawned in order to execute the reconnect tasks. After successfully reconnect, it will spawn another goroutine to listen to gRPC server.

for {
    select {
    case <-ctx.Done():
        return
    case <-reconnectCh:
        if client.Conn.GetState() != connectivity.Ready && *isConnectedWebSocket {
            if o.waitUntilReady(client, isConnectedWebSocket, ctx) {
                err := o.generateNewProcessOrderStream(client, ctx)
                if err != nil {
                    logger.Logger.Error("failed to establish stream connection to grpc server ...")
                }

                // re-listening server side streaming
                go o.listenProcessOrderServerSide(client, reconnectCh, ctx, isConnectedWebSocket)
            }
        }
    }
}

Note that the listening task is handled concurrently by another goroutine.

// listening server side streaming
go o.listenProcessOrderServerSide(client, reconnectCh, websocketCtx, isConnectedWebSocket)

You can check out my code example here. Hope this helps.

Credit: Emin Laletovic

yyh
  • 21
  • 1