0

I'm running Ubuntu 18.10 on a VPS.

Ever since upgrading (I'm pretty sure) from 16.04, my secondary IP address just stops receiving traffic after it's been up for a few hours.

I'll have two pings running, to my primary IP and my secondary IP, and the secondary one will just spontaneously go down after about 3 to 4 hours.

When this happens, its interface, eth0, will still show as <UP> in ifconfig -a. An mtr will make it all the way to its gateway.

A reboot brings the IP back up and reachable. Nothing else. Not ifdown eth0 --force && ifup eth0, not service networking restart.

Relevant interfaces:

$ ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 149.210.175.202  netmask 255.255.255.0  broadcast 149.210.175.255
        ether 52:54:00:35:97:95  txqueuelen 1000  (Ethernet)
        RX packets 703  bytes 65063 (65.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 146  bytes 19511 (19.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Not sure why the second IP doesn't show in ifconfig -a, as it does come up in ip a:

$ ip a
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:35:97:95 brd ff:ff:ff:ff:ff:ff
    inet 149.210.175.202/24 brd 149.210.175.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 149.210.176.154/24 brd 149.210.176.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2a01:7c8:aab3:44f:5054:ff:fe35:9795/64 scope global dynamic mngtmpaddr 
       valid_lft 2591978sec preferred_lft 604778sec
    inet6 fe80::5054:ff:fe35:9795/64 scope link 
       valid_lft forever preferred_lft forever

I'm not using netplan.io because some blog posted about it being buggy, which made me take it for the cause. However, with netplan.io, the exact same problem exists. I've had ufw disabled for a while, didn't matter.

The hosting company has been kind enough to help me troubleshoot but to no avail. They have even migrated the VPS to another hypervisor, which made no difference. As a last resort, I have upgraded the kernel all the way to 5.1.8-050108-generic.

What can I do to find out more about what might be causing the intermittent outages?

jorisw
  • 103
  • 4
  • 2
    During the time when you cannot connect to the secondary IP, can your VPS ping it's secondary gateway? – Joel C Jun 10 '19 at 22:10
  • What does a tcpdump look like on that interface if you ping or connect to the system when it is in the 'broken' state? Do you see incoming packets? Do you see any replies? – Zoredache Jun 10 '19 at 22:56
  • "Not sure why the second IP doesn't show in ifconfig -a" -- because `ifconfig` is awful, and should not be used. – womble Jun 11 '19 at 03:26
  • @JoelC `route -n` lists `0.0.0.0` as the gateway for both IPs, and `149.210.175.1` as the gateway for destination `0.0.0.0`. I cannot ping `149.210.175.1` from the machine, even though the primary IP is up, and I am SSHed into the machine through that. I can ping `149.210.175.1` from the outside. – jorisw Jun 11 '19 at 07:10
  • @Zoredache A `tcpdump host 149.210.176.154` only shows this line repeated: `09:13:23.171521 ARP, Request who-has my.domain tell 149.210.176.1, length 46` - No sign of the pings I'm sending. – jorisw Jun 11 '19 at 07:14
  • Have you confirmed all firewalls are disabled or turned off? – Joel C Jun 11 '19 at 22:40

1 Answers1

0

Best time of troubleshoot is the time of the issue.

  1. Check the kernel logs with dmesg or journalctl -k command.
  2. Run the ip monitor command to see what events and changes happen.
  3. Ping the gateway from both addresses and some hosts behind it by ip address.
  4. Check the arp table (ip n ls). Address of the gateway should be REACHABLE.
  5. Run the tcpdump and write traffic into a file.
  6. Analyze the dump of traffic with wireshark. Likely on this step you'll see the cause of the issue.
  7. Analyze of the nstat -az output can be very helpful too, but only after steps above.
Anton Danilov
  • 5,082
  • 2
  • 13
  • 23
  • Thank you. 1 - The kernel log doesn't show anything around the time the IP went down. 2 - Shows lines for the gateway, ending in `STALE`, `PROBE`, `REACHABLE`. 3 - The gateway responds to ICMP from the outside but not the inside, even though the primary IP is working. 4 - It's `REACHABLE`. – jorisw Jun 11 '19 at 07:20
  • 5 - So far only shows `ARP, Request who-has my.domain tell 149.210.176.1, length 46` type lines. – jorisw Jun 11 '19 at 07:26
  • But, as I've understood you, you haven't the ARP replies. Is the `my.domain` address your secondary ip address? Better use the `-n` option of the `tcpdump` to avoid the reverse name resolution. – Anton Danilov Jun 11 '19 at 08:01