We have a Concourse Web Container and a Concourse Worker Container running on Server A (212.77.7.255 - real IP is conceiled). We use the latest Concourse Version 7.8.1.
As we ran out of Worker resources, we added another Concourse Worker Container running on Server B. The Worker on Server B has been running fine for about five days, but all of a sudden it is not able to connect anymore to Concourse Web on Server A.
The logs of the Worker on Server B say:
{
"timestamp": "2022-07-12T11:15:59.542 985762Z",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5430446562",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4.2"
}
}{
"timestamp": "2022-07-12T11:15:59.5430608042",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5430689532",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-get-containers-to-destroy",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5541187512",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper. tick.failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5541648442",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper.tick.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4.3"
}
}{
"timestamp": "2022-07-12T11:15:59.5541725932",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper.tick.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:15:59.554179789Z",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper. tick. failed-to-get-volume 3-to-destroy",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:16:04.5802200122",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon. failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.580284659Z",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1.10"
}
}{
"timestamp": "2022-07-12T11:16:04.5803353772",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.5803598682",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.exited-with-error",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.580372552Z",
"level": "debug",
"source": "worker",
"message",
"worker.beacon-runner.beacon.done",
"data": {
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.5803948792",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.failed",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4"
}
}
The logs on Concourse Web on Server A show no entries of the Worker on Server B trying to connect. On Server B I'm able to connect to Concourse Web on Server A:
$ nc 212.77.7.255 2222
SSH-2.0-Go
We had this problem before, but we solved it by upgrading Concourse to the latest version 7.8.1. Now I'm running out of options where to debug this. What I've tried:
- restarting the workers
- restarting the web container
- pruning the stalled worker of Server B
docker system prune
on Server B
Nothing does help. What can I do to debug this further and make the Worker on Server B connect again?