We are having trouble with network routing configuration in Ubuntu Xenial.
We have many servers with both Debian 8.4 (Jessie) and Ubuntu 16.04.2 (xenial) and the exact same networking setup (or at least as far as we can see).
They all have two NICs attached to two VLANs (Say "A" and "B") both accessible though other VLANs say, for example, from VLAN "C".
Both /etc/network/interfaces
files are of the form:
NOTE: I faked names and IPs for the sake of better readability.
# VLAN A
auto eth0
iface eth0 inet static
address 192.168.111.xxx
netmask 255.255.255.0
broadcast 192.168.111.255
network 192.168.111.0
gateway 192.168.111.254
dns-nameservers 192.168.111.25 192.168.111.26
# VLAN B
auto eth1
iface eth1 inet static
address 192.168.222.xxx
netmask 255.255.255.0
broadcast 192.168.222.255
network 192.168.222.0
gateway 192.168.222.254 # <-- (Commented out in Ubuntu machine)
dns-nameservers 192.168.111.25 192.168.111.26
...say xxx
is 100 for Debian Machine and 200 for Ubuntu machine and I'm
trying to ping from 192.168.1.10 in VLAN "C" to following addresses:
- 192.168.111.100: Works fine.
- 192.168.222.100: Works fine.
- 192.168.111.200: Works fine.
- 192.168.222.200: NO Answer!!
The "B" vlan is used mostly for backup and other "background" traffic to avoid saturation problems in vlan "A".
I know that having two network paths to access same machine is not an usual setup and I must say that only being able to connect thought one of them from other networks is not a big problem nowadays. But what stucks to me is why I can access to Debian Machines and not to Ubuntu ones?
Even, on the other hand, if it were working well in both platforms, we could consider closing some services (such as ssh, and backend interfaces) from NIC "A" to improve security (Our firewall only allows access to vlan "B" from our IT staff vlan).
Of course, as it is commented in previous interfaces snippet, gateway row is commented out in Ubuntu machines, but that is because, networking initialization fails in that machines otherwise. That is, in fact, what we are trying to solve.
But both machines routing tables are almost identical. The only difference I could see was the onlink flag in the Ubuntu machine:
myUser@debianMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.100
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.100
myUser@ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0 onlink
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.200
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.200
...but I was able to remove it by following command:
myUser@ubuntuMachine:~$ sudo ip route replace default via 192.168.111.254 dev eth0
myUser@ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.200
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.200
And it did'nt fix the problem.
After that, I also tried to uncomment gateway row of 'VLAN B' which, as I said, it were commented out in /etc/network/interfaces file and tryed to restart networking but this is what happened:
myUser@ubuntuMachine:~$ sudo /etc/init.d/networking restart
[....] Restarting networking (via systemctl): networking.serviceJob for networking.service failed because the control process exited with error code. See "systemctl status networking.service" and "journalctl -xe" for details.
failed!
...and the onlink flag came back again.
As a note, commenting out that line again and issuing new
/etc/init.d/networking restart
command, the output is the same until the machine is rebooted, (even networking, despite the VLAN B default gateyay issue, continues working as usual).
Following are the output of suggested commands:
myUser@ubuntuMachine:~$ sudo systemctl status networking.service
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
Drop-In: /run/systemd/generator/networking.service.d
└─50-insserv.conf-$network.conf
Active: failed (Result: exit-code) since jue 2017-12-21 14:55:29 CET; 42s ago
Docs: man:interfaces(5)
Process: 8552 ExecStop=/sbin/ifdown -a --read-environment --exclude=lo (code=exited, status=0/SUCCESS)
Process: 8940 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 8934 ExecStartPre=/bin/sh -c [ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-envi
Main PID: 8940 (code=exited, status=1/FAILURE)
dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILUR
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.
...and the meaningful part of sudo journalctl -xe
:
dic 21 14:55:29 ubuntuMachine sudo[8922]: myUser : TTY=pts/0 ; PWD=/home/myUser ; USER=root ; COMMAND=/etc/init.d/networking restart
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session opened for user root by myUser(uid=0)
dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
-- Subject: Unit networking.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has finished shutting down.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
-- Subject: Unit networking.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has begun starting up.
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
-- Subject: Unit networking.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has failed.
--
-- The result is failed.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session closed for user root
I googled a lot about being able to found some related information but none fully answering my question:
An explanation of "onlink" flag that seemed to me it were pointing out the possibilitity that the "onlink" flag were responsible of a "wrong back routing" in the meaning that «tells the kernel that the it does not have to check if the gateway is reachable directly by the current machine» so (I figured out) the kernel may thought it could (or should) route the answers of incomming connections from VLAN C to the default gateway instead of thought the same NIC from where the connection was started.
- But, as I said, removing the "onlink" flag didn't seem to change anything.
This unix StackExchange answer seems to solve the problem (I didn't tested it yet) by using multiple routing tables and rules (to tell the kernel which table to use). But it doesn't explain why Debian machines are working well (I checked /etc/iproute2/rt_tables file of both machines and they are identical too:
myUser@bothMachines:~$ sudo cat /etc/iproute2/rt_tables
#
# reserved values
#
255 local
254 main
253 default
0 unspec
#
# local
#
#1 inr.ruhep
So my final hypothesis is that it could be just an implementation difference between kernel versions and, having that ubuntu one is much more recent, this could be the correct behaviour so, in modern kernels, I need to use two different routing tables (but I'm not sure and don't know why...).
myUser@debianMachine:~$ sudo uname -a
Linux debianMachine 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
myUser@ubuntuMachine:~$ sudo uname -a
Linux ubuntuMachine 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
And, hence, the question is:
Are we doing something wrong (or there is some bug in them) in the Ubuntu machines? Or, conversely, this is the correct behaviour and we are forced to setup more complex routing schema (either by per-vlan routes or by using two routing tables to make two default gateway's to work again)?
EDIT:
Now I tried to add static route to fix the problem:
myUser@ubuntuMachine:~$ sudo ip route add 192.168.1.0/24 via 192.168.222.254 dev eth1
...but that freezed my ssh connection (thought NIC A) even I could then connect thought NIC B (at 192.168.111.200)
Both rules at the same time seems to not being possible:
myUser@ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 102.168.111.254 dev eth0
myUser@ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 192.168.222.254 dev eth1
RTNETLINK answers: File exists
EDIT 2:
I finally found the Linux Advanced Routing & Traffic Control HOWTO which seems to be more accurate than all other documentation I found and specifically in its Chapter 4. Rules - routing policy database I see following text:
If you want to use this feature, make sure that your kernel is compiled with the "IP: advanced router" and "IP: policy routing" features
...so I thing all points to that my previous hypothesis of a kernel implementation difference was right and that difference is concretely is those two features being compiled in.