1

I'm creating a Mattermost bot. It stops responding after the websocket connection receives a ping timeout (PingTimeoutChannel) after random periods of time (1 minute, 8 minutes, 2 hours etc.). Mattermost server is v.5.13, API v.4.

The bot connects to the Mattermost API by creating new Client4. Next it logs in as the user and after it creates a Websocket client with the authorization token received. It starts listening on all channels and when it receives an event which is a message directed to him (@botname) it responds automatically (creates model.post).

I chose to use simple username/password authentication for logging in, just as it is in the Mattermost sample bot. However, I tried to rewrite it to the personal access token authentication (as in here) because I'd thought it'd solve the timeout problem. However, this solution doesn't work anymore, it gives the "Invalid or expired session error, please login again" while trying to login that way.

So I dropped this idea and started searching where the timeout happens. The server pings are ok, the websocket's are not. I tried many ways, to the point where I just reconnect (by creating new Mattermost API and Websocket clients again). The bot still does not respond. I've run out of ideas.

Websocket connection (skipped error handling):

    if config.BotCfg.Port == "443" {
        protocol = "https"
        secure = true
    }

        config.ConnectionCfg.Client = model.NewAPIv4Client(fmt.Sprintf("%s://%s:%s", protocol, config.BotCfg.Server, config.BotCfg.Port))


    user,resp := config.ConnectionCfg.Client.Login(config.BotCfg.BotName, config.BotCfg.Password)

    setBotTeam()

    if limit.Users == nil {
        limit.SetUsersList()
    }

    ws := "ws"
    if secure {
        ws = "wss"
    }

    if Websocket != nil {
        Websocket.Close()
    }

    websocket, err := model.NewWebSocketClient4(fmt.Sprintf("%s://%s:%s", ws, config.BotCfg.Server, config.BotCfg.Port), config.ConnectionCfg.Client.AuthToken)

Listening function:

        for {
            select {

            case <-connection.Websocket.PingTimeoutChannel:
                logs.WriteToFile("Websocket ping timeout. Connecting again.")
                log.Println("Websocket ping timeout. Connecting again.")
                mux.Lock()
                connection.Connect()
                mux.Unlock()

            case event := <-connection.Websocket.EventChannel:
                mux.Lock()
                if event != nil {
                    if event.IsValid() && isMessage(event.Event){
                        handleEvent(event)
                    }
                }
                mux.Unlock()
            }
        }
    }()
    // block to the go function
    select {}

I expect the bot to run continuously. If you have any suggestions how to fix this issue, I'd really appreciate that!

Edit: As Cerise suggested, I added the SIGQUIT to the exit function and ran a race detector. Fixed the data race issue by deleting one if from the case event := [...]. Race detector doesn't report any issues anymore, however the bot still stops responding after some time.

I found out that the first time PingTimeout occurs, the peer stops responding until I restart the app. The reconnection of Websocket doesn't help. However, I don't actually know how to solve this problem or does the solution even exist.

gabkub
  • 11
  • 3
  • Run the program with the [race detector](https://golang.org/doc/articles/race_detector.html) and fix any issues. Describe in more detail what the program is doing when it has stopped responding. Is it blocked on the select in the question or somewhere else? If you don't know what the progrm is doing, then dump the goroutine stacks by ending the program a SIGQUIT. – Charlie Tumahai Jul 23 '19 at 15:28

0 Answers0