I am building a Kubernetes cluster using kubeadm and have an issue with a single node.
The worker nodes are running with sub-interfacing and policy based routing, which work as intended; however, out of the 4 worker nodes, if pods are moved to one of them, they fail liveness and readiness checks over http.
I am using Kubernetes version 1.26.1, calico 3.25.0, metallb 0.13.9, and ingress-nginx 4.5.0.
The cluster stood up with little issue; outside of getting the policy based routing on the nodes worked out. Calico and MetalLB stood up and work as well.
The issue now is when I stand up the ingress-nginx controllers and force the pods on to a specific worker node. Standing them up and running on them on the other nodes works and I can curl the LoadBalancer IP; however, while testing, when the ingress-nginx pods are moved to a specific node, the liveness and readiness checks fail. Moving the pods back to any other worker node they come up and run just fine.
I've been verifying the routes and iptables on all the nodes; as well as, watching the interfaces via tcpdump, but I've not narrowed down the issue.
For the simple things:
- kernel parameters and loaded modules between the nodes are the same
- No logs in messages/crio is showing an issue with starting the pod
- the calico and metallb pods are working on the problem node
- I've rebuilt the cluster since noticing the issue, and prior builds cert-manager was having issues on the node, as well as a few other random test deployments I've tried
From with the pods while they are running, I can hit external webs via curl (dns work and outbound traffic work) Using tcpdump on 'any' interface of the problem node, i can see the pod and the kubernetes internal api IP communicate I can't hit the pod's IP, service IP, or anything from the problem node or other member node the namespace events aren't showing any issues except for the liveness and readiness probes failing The endpoints for the services aren't being filled while on the problem node (although this isn't a surprise). Watching the traffic over the vxlan.calico interface isn't showing only one way traffic - there are responses to traffic that is making it through.
Im at a lose on where to look for the root issue. This has been going on for over a week and I could use some help.