In our church we get Internet uplink from a friendly neighbor’s WiFi. At first everything seemed (at least) fine but since some weeks general “Internet stability problems” increased and since approx. 2 weeks my Raspberry Pi on site couldn’t establish an SSH link to the Internet.
I don’t understand the root cause yet but the workaround is: set the MTU to 1452 on the client.
I would prefer not to decrease the MTU for the whole network and I’d very much like to understand what’s actually going on.
Overall question is: Why does setting MTU to 1452 solve any Internet connection problems and how to fix the root cause?
Here is kind of a network graph:
Vodafone -------------- Kabelbox ) ) ) ( ( ( NanoStation loco m5 -------------- Linux client
DS-Lite over PPPoE 192.168.0.0/24 192.168.157.0/24
Default MTU:
1452 2286 1500
From left to right
- Vodafone ISP
- DS-Lite over PPPoE via TV cable
- "Kabelbox" (German marketing name for cable box, combined cable modem + router + WiFi AP)
- WiFi 5GHz link (Kabelbox provides a seperate “Backbone” SSID)
- Ubiquity NanoStation loco m5 (WiFi client, IPv4 router, OS based on Linux 2.6.32.71)
- Ethernet cable
- Linux client (routinely a switch + Raspberry Pi, for debugging direct connection to laptop with Debian Stretch)
This was pretty hard to debug because the problem is not deterministic and at least seems not to be permanent. I have not yet found a minimal test to demonstrate the issue.
Symptoms
- Raspberry Pi couldn’t establish SSH connection for many days
- Raspberry Pi couldn’t send system management emails for many days
- DNS resolution sometimes failed
- web pages sometimes didn’t load (even the same URLs sometimes did, then didn’t)
- Signal desktop and Android app wouldn’t connect/update (actually the most reliable test I found)
All issues above are timeout issues. The programs would try but never receive a response, because either request or response packets/frames were dropped.
My research
After some poking around in the dark I found this is probably MTU related. Setting the MTU to 1200 by chance removed the symptoms.
I’ve inspected the default MTU values:
ip link
on the Linux client showed 1500 (pretty standard to my experience)ip link
on the NanoStation showed 1500 for LAN and 2268 for WiFi (that is okay for 802.11)traceroute --mtu www.daniel-boehmer.de
showed 1452 between Kabelbox and Vodafone (this seems default for DS-Lite over PPPoE)
To my knowledge both the NanoStation and the Kabelbox routers should transparently fragment packets larger than the the next hop’s MTU. I can prove the Kabelbox at least sometimes does that properly (commands run at NanoStation):
XW.v6.3.2-cs.33267.200715.1627# ip link set mtu 1500 dev ath0
XW.v6.3.2-cs.33267.200715.1627# ping www.daniel-boehmer.de -c1 -s 1472
PING www.daniel-boehmer.de (185.142.180.110): 1472 data bytes
1480 bytes from 185.142.180.110: seq=0 ttl=52 time=38.633 ms
--- www.daniel-boehmer.de ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 38.633/38.633/38.633 ms
I am also pretty sure the NanoStation does properly fragment. It returns an error if DF
is set:
root@linux:~# ip li set mtu 1500 dev wlp3s0
root@linux:~# ping www.daniel-boehmer.de -c1 -s 1472 -M dont
PING www.daniel-boehmer.de (185.142.180.110) 1472(1500) bytes of data.
1480 bytes from lists.christallin.net (185.142.180.110): icmp_seq=1 ttl=51 time=38.8 ms
--- www.daniel-boehmer.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 38.756/38.756/38.756/0.000 ms
root@linux:~# ping www.daniel-boehmer.de -c1 -s 1472 -M do
PING www.daniel-boehmer.de (185.142.180.110) 1472(1500) bytes of data.
From 192.168.157.1 (192.168.157.1) icmp_seq=1 Frag needed and DF set (mtu = 1452)
--- www.daniel-boehmer.de ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
What causes the problems then??
I tried setting the MTU to as low as 1200 on the NanoStation for all interfaces but this made no difference to the Linux client, e.g. Signal wouldn’t connect.
As soon as I set the MTU to 1452 on the client, Signal does connect. All symptoms gone.
I am confused. (I hope my notes are still comprehensible.)
Why does the MTU need to be set on the client and neither router does fragment transparently?