Vanishing network connectivity in HPC cluster

Question

Hello and thanks to everyone in advance! My team and I have been plagued by the stability of a cluster that we use for scientific research. We have a lot of science and software engineering experience, but not too much experience running a cluster. I will try to be as brief as possible.

We run an HPC cluster of about 10 machines each with anywhere from 4 to 8 NVIDIA GeForce 1080 GTX GPUs that we use for scientific computing. The machines themselves are Supermicro GPU SuperServers (we have a few different models). Each of these motherboards has two general purpose NICs, only one of which is connected to our network. Further, the machines have an independent management (IPMI) NIC which are also connected (to the same network). Note: all NICs are connected to the same subnet. The network is run by a Meraki MX84 router and a 24 port Netgear router sits between the router and the machines.

There are two other special machines; one runs MAAS, which is what we use to manage the cluster. The other is has a RAID controller and a few terabytes of RAID5 array. All machines are connected to this machine via NFS.

All machines are running Ubuntu Server 16.04

The machines are located in a colocation center about an hour away from our office. We have two ways of connecting to these machines: 1) a VPN into the network provided by Meraki and 2) ssh via a reverse tunnel to another machine we have running in the cloud.

Under normal circumstances we have CPU and GPU intensive jobs running on the GPU machines which load necessary data from the NFS mounted RAID array.

The problem: the system is not stable! We cannot get more than a few days of runtime out of these machines before everything goes to hell. Here are the symptoms of hell:

Most of the machines cannot be connected to (neither SSH nor VPN).
The machines that are inaccessible are also inaccessible via IPMI
Some of the machines CAN be connected to but present a very slow shell (by which I mean you can type commands but there is a noticeable lag between keystroke and response; feels very much like a network problem)
Those machines that we can get into seem to have broken outbound Internet connectivity. Specifically, ping google.com results in a DNS resolution problem: unknown host google.com
It is not enough to soft reboot the machines; to restore function, we have to power cycle via remote PDU.

Our investigations have revealed that the machines we can't get into at all are actually still alive; it is a network problem that blocks our access to them. At the bottom of the post is a log that I pulled from one of the "dead" machines after a reboot. What you see is normal DHCP activity occurring periodically until around 3:00 AM, at which point DHCPDISCOVER broadcasts start failing. Of course, at this point the ssh tunnels (which are run with autossh) start failing.

My original theory on this was that the culprit was MAAS, since we were using it's DHCP server rather than that provided by the Meraki router. To test this theory, I rebuilt the cluster with a new installation of MAAS, this time using Meraki's DHCP service rather than that of MAAS. After two days, the system failed in the standard way, so I think I've ruled out MAAS (at least as far as DHCP is concerned).

Some on our team have an intuition that NFS is to blame. The theory is something like NFS fails and then everything else freaks out. We know that when NFS dies client filesystems have a hard time recovering, but it is unclear how this would affect the network.

Any help on this issue would be great. As I said; none of us have a lot of experience running a cluster, so pointers on where to start looking would be good. Even better would be some ideas on specifically what the issue is and how to fix it.

Thanks in advance!

Log example:

Apr 11 02:02:31 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 252 seconds.
Apr 11 02:06:43 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:06:44 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:06:44 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 291 seconds.
Apr 11 02:11:35 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:11:35 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:11:35 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 239 seconds.
Apr 11 02:15:35 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:15:35 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:15:35 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 275 seconds.
Apr 11 02:17:01 cluster9 CRON[7877]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 11 02:20:11 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:20:11 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:20:11 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 250 seconds.
Apr 11 02:24:21 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:24:22 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:24:22 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 279 seconds.
Apr 11 02:29:01 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:29:01 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:29:01 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 288 seconds.
Apr 11 02:33:49 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:33:49 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:33:49 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 281 seconds.
Apr 11 02:38:30 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:38:30 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:38:30 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 296 seconds.
Apr 11 02:43:26 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:43:26 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:43:26 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 270 seconds.
Apr 11 02:47:56 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:47:56 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:47:56 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 260 seconds.
Apr 11 02:52:16 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:52:16 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:52:16 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 276 seconds.
Apr 11 02:56:52 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:56:52 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:56:52 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 254 seconds.
Apr 11 03:01:06 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 03:01:06 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 03:01:06 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 241 seconds.
Apr 11 03:01:30 cluster9 systemd[1]: Started Session 488 of user ubuntu.
Apr 11 03:05:07 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 03:05:07 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 03:05:07 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 290 seconds.
Apr 11 03:09:57 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 03:13:51 cluster9 dhclient[1558]: message repeated 18 times: [ DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)]
Apr 11 03:14:04 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 255.255.255.255 port 67 (xid=0x5f5d62a8)
Apr 11 03:15:05 cluster9 dhclient[1558]: message repeated 5 times: [ DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 255.255.255.255 port 67 (xid=0x5f5d62a8)]
Apr 11 03:15:08 cluster9 avahi-daemon[1465]: Withdrawing address record for 192.168.128.120 on enp3s0f0.
Apr 11 03:15:08 cluster9 avahi-daemon[1465]: Leaving mDNS multicast group on interface enp3s0f0.IPv4 with address 192.168.128.120.
Apr 11 03:15:08 cluster9 avahi-daemon[1465]: Interface enp3s0f0.IPv4 no longer relevant for mDNS.
Apr 11 03:15:08 cluster9 systemd[1]: Stopping Network Time Synchronization...
Apr 11 03:15:08 cluster9 systemd[1]: Stopped Network Time Synchronization.
Apr 11 03:15:08 cluster9 systemd[1]: Starting Network Time Synchronization...
Apr 11 03:15:08 cluster9 systemd[1]: Started Network Time Synchronization.
Apr 11 03:15:08 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 3 (xid=0x3bb49111)
Apr 11 03:15:11 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 5 (xid=0x3bb49111)
Apr 11 03:15:16 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:15:27 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 16 (xid=0x3bb49111)
Apr 11 03:15:43 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x3bb49111)
Apr 11 03:16:03 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 16 (xid=0x3bb49111)
Apr 11 03:16:19 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x3bb49111)
Apr 11 03:16:36 cluster9 autossh[13532]: timeout polling to accept read connection
Apr 11 03:16:36 cluster9 autossh[13532]: port down, restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 476)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8161
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 477)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8162
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 478)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8163
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 479)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8164
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 480)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8165
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 481)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8166
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:38 cluster9 autossh[13532]: starting ssh (count 482)
Apr 11 03:16:38 cluster9 autossh[13532]: ssh child pid is 8167
Apr 11 03:16:38 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:39 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 19 (xid=0x3bb49111)
Apr 11 03:16:46 cluster9 autossh[13532]: starting ssh (count 483)
Apr 11 03:16:46 cluster9 autossh[13532]: ssh child pid is 8168
Apr 11 03:16:46 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:58 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 19 (xid=0x3bb49111)
Apr 11 03:17:01 cluster9 CRON[8170]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 11 03:17:04 cluster9 autossh[13532]: starting ssh (count 484)
Apr 11 03:17:04 cluster9 autossh[13532]: ssh child pid is 8172
Apr 11 03:17:04 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:17:17 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:17:28 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:17:36 cluster9 autossh[13532]: starting ssh (count 485)
Apr 11 03:17:36 cluster9 autossh[13532]: ssh child pid is 8173
Apr 11 03:17:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:17:39 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 12 (xid=0x3bb49111)
Apr 11 03:17:51 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 17 (xid=0x3bb49111)
Apr 11 03:18:08 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:18:19 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 18 (xid=0x3bb49111)
Apr 11 03:18:26 cluster9 autossh[13532]: starting ssh (count 486)
Apr 11 03:18:26 cluster9 autossh[13532]: ssh child pid is 8174
Apr 11 03:18:26 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:18:37 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 10 (xid=0x3bb49111)
Apr 11 03:18:47 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 14 (xid=0x3bb49111)
Apr 11 03:19:01 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 21 (xid=0x3bb49111)
Apr 11 03:19:22 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 9 (xid=0x3bb49111)
Apr 11 03:19:31 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 13 (xid=0x3bb49111)
Apr 11 03:19:38 cluster9 autossh[13532]: starting ssh (count 487)
Apr 11 03:19:38 cluster9 autossh[13532]: ssh child pid is 8175
Apr 11 03:19:38 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:19:44 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 7 (xid=0x3bb49111)
Apr 11 03:19:51 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 8 (xid=0x3bb49111)
Apr 11 03:19:59 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 10 (xid=0x3bb49111)
Apr 11 03:20:09 cluster9 dhclient[1558]: No DHCPOFFERS received.
Apr 11 03:20:09 cluster9 dhclient[1558]: No working leases in persistent database - sleeping.
Apr 11 03:21:16 cluster9 autossh[13532]: starting ssh (count 488)
Apr 11 03:21:16 cluster9 autossh[13532]: ssh child pid is 8182
Apr 11 03:21:16 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:23:24 cluster9 autossh[13532]: starting ssh (count 489)
Apr 11 03:23:24 cluster9 autossh[13532]: ssh child pid is 8183
Apr 11 03:23:24 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:26:06 cluster9 autossh[13532]: starting ssh (count 490)
Apr 11 03:26:06 cluster9 autossh[13532]: ssh child pid is 8185
Apr 11 03:26:06 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:26:36 cluster9 autossh[13532]: starting ssh (count 491)

It sounds like the Meraki (guess this is your switch too?) can't keep up. Can you see what the load on it is like? — Eddie Dunn, Apr 21 '17 at 17:27
Sorry; I forgot to mention that there is a Netgear switch between the Meraki router and the computers. — Joshua Gevirtz, Apr 21 '17 at 17:32
That is likely your issue. You did not say what kind of code you are running but I have had really bad luck trying to do MPI or similar type operations. Actually almost the exact same behavior with a GS748T. I would look at getting a white label switch (10 Gigabit if you can swing it) that is up the the task. Or even better have a separate infiniband network. — Eddie Dunn, Apr 21 '17 at 17:58
Interesting. We have an infiniband network, actually; we use it for GPU <-> GPU communication via Mellanox GPU RDMA. Sounds like it would be worth trying to spread data via IB just to see if it impacts stability. — Joshua Gevirtz, Apr 21 '17 at 18:01

Vanishing network connectivity in HPC cluster

0 Answers0