Concourse Worker on another server loses connection to Concourse Web

Question

We have a Concourse Web Container and a Concourse Worker Container running on Server A (212.77.7.255 - real IP is conceiled). We use the latest Concourse Version 7.8.1.

As we ran out of Worker resources, we added another Concourse Worker Container running on Server B. The Worker on Server B has been running fine for about five days, but all of a sudden it is not able to connect anymore to Concourse Web on Server A.

The logs of the Worker on Server B say:

{
    "timestamp": "2022-07-12T11:15:59.542 985762Z",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.failed-to-connect-to-tsa",
    "data": {
        "error": "dial tcp 212.77.7.255:2222: i/o timeout",
        "session": "6.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5430446562",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.dial.failed-to-connect-to-any-tsa",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "6.4.2"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5430608042",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.failed-to-dial",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "6.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5430689532",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.failed-to-get-containers-to-destroy",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "6.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5541187512",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper. tick.failed-to-connect-to-tsa",
    "data": {
        "error": "dial tcp 212.77.7.255:2222: i/o timeout",
        "session": "7.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5541648442",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper.tick.dial.failed-to-connect-to-any-tsa",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "7.4.3"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5541725932",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper.tick.failed-to-dial",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "7.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.554179789Z",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper. tick. failed-to-get-volume 3-to-destroy",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "7.4"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5802200122",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon. failed-to-connect-to-tsa",
    "data": {
        "error": "dial tcp 212.77.7.255:2222: i/o timeout",
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.580284659Z",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon.dial.failed-to-connect-to-any-tsa",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4.1.10"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5803353772",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon.failed-to-dial",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5803598682",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon.exited-with-error",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.580372552Z",
    "level": "debug",
    "source": "worker",
    "message",
    "worker.beacon-runner.beacon.done",
    "data": {
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5803948792",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.failed",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4"
    }
}

The logs on Concourse Web on Server A show no entries of the Worker on Server B trying to connect. On Server B I'm able to connect to Concourse Web on Server A:

$ nc 212.77.7.255 2222
SSH-2.0-Go

We had this problem before, but we solved it by upgrading Concourse to the latest version 7.8.1. Now I'm running out of options where to debug this. What I've tried:

restarting the workers
restarting the web container
pruning the stalled worker of Server B
docker system prune on Server B

Nothing does help. What can I do to debug this further and make the Worker on Server B connect again?

oozie _at_ concourse.farm · Answer 1 · 2022-07-13T20:28:53.013

1

You said it happened to an earlier version, you "ran out of Worker resources", and I'm seeing I/O timeout in the logs... the one component you didn't mention is the DB.

It might be that the max conns on the DB has been reached, especially if the DB is used for purposes other than just Concourse. That's where I'd look next.

edited Jul 13 '22 at 20:28

answered Jul 12 '22 at 21:27

oozie _at_ concourse.farm

1,240
1
11
14

Then there should have been some sort of error in the logs like "Can not connect worker, as max db connections have been reached". I didn't find anything like that. – mles Jul 14 '22 at 15:35

score 0 · Accepted Answer · answered Jul 14 '22 at 15:39

We couldn't find out why the docker network did not allow connecting to Server A. As connections on the host machine were going through, we told docker to use the host network:

services:
  concourse-worker:
    ...
    network-mode: host
    ...

This solved the issue. Not a pretty workaround, as the docker container should have it's own separated network, but as there is nothing else running on this server it's fine.

Concourse Worker on another server loses connection to Concourse Web

2 Answers2