I have been investigating root cause of a hairy routing issue on a centos7 cluster...
Behavior:
- TCP packets from Docker Container reach targets outside of the cluster, but response packets do not reach container that is waiting for that answer
- Using logging of iptables now strongly indicates that "routing decision" (in iptables speak) causes this problem. More precisely: response packets still exists at stage "mangle PREROUTING" but are missing at stage "mangle FORWARD/INPUT"
- playing around with "ip route get" results in:
## Check route from container to service host outside of cluster
ip route get to 172.17.27.1 from 10.233.70.32 iif cni0
## Works just fine as metioned. Result:
# 172.17.27.1 from 10.233.70.32 dev ens192
# cache iif cni0
## Check route from service host outside of cluster back to container
ip route get to 10.233.70.32 from 172.17.27.1 iif ens192
## Does not work. Error Msg:
# RTNETLINK answers: No route to host
- Then I was pretty sure that there must be a wrong configured route somewhere in routing table. Command "ip route list" gives:
default via 172.17.0.2 dev ens192 proto static
10.233.64.0/24 via 10.233.64.0 dev flannel.1 onlink
10.233.65.0/24 via 10.233.65.0 dev flannel.1 onlink
10.233.66.0/24 via 10.233.66.0 dev flannel.1 onlink
10.233.67.0/24 via 10.233.67.0 dev flannel.1 onlink
10.233.68.0/24 via 10.233.68.0 dev flannel.1 onlink
10.233.69.0/24 via 10.233.69.0 dev flannel.1 onlink
10.233.70.0/24 dev cni0 proto kernel scope link src 10.233.70.1 # this is the local container network
10.233.71.0/24 via 10.233.71.0 dev flannel.1 onlink
172.17.0.0/18 dev ens192 proto kernel scope link src 172.17.31.118
192.168.1.0/24 dev docker0 proto kernel scope link src 192.168.1.5 linkdown
Although I couldn't find any error in this rules above it gets even more confusing when comparing with a second cluster that was configured using the same ansible scripts. Output of the healthy cluster:
- "ip route get":
## Check route from container to service host outside of cluster ip route get to 172.17.27.1 from 10.233.66.2 iif cni0 ## Works: # 172.17.27.1 from 10.233.66.2 dev eth0 # cache iif cni0 ## Check route from service host outside of cluster back to container ip route get to 10.233.66.2 from 172.17.27.1 iif eth0 ## Worked! But why when using same rules as unhealthy cluster above? - please see below: # 10.233.66.2 from 172.17.27.1 dev cni0 # cache iif eth0
- "ip route list":
default via 172.17.0.2 dev eth0 proto dhcp metric 100 10.233.64.0/24 via 10.233.64.0 dev flannel.1 onlink 10.233.65.0/24 via 10.233.65.0 dev flannel.1 onlink 10.233.66.0/24 dev cni0 proto kernel scope link src 10.233.66.1 # this is the local container network 10.233.67.0/24 via 10.233.67.0 dev flannel.1 onlink 172.17.0.0/18 dev eth0 proto kernel scope link src 172.17.43.231 metric 100 192.168.1.0/24 dev docker0 proto kernel scope link src 192.168.1.5 linkdown
Any ideas? hints?
thank you so much!