-1

I used kubeadm to deploy a bare-metal cluster with one control plane node and one worker node on the same LAN. After initializing the cluster (kubeadm init on the cp and kubeadm join on the worker node), I installed calico via helm. The calico-node and calico-kube-controllers pods do not reach ready state. However, they seem to be functioning correctly, and if I manually call the commands that the liveness and readiness probes execute, I get the expected success response. I may have a calico-specific problem, but my immediate question is what could cause this behavior with the readiness probes?

The output of kubectl describe pod -n calico-system calico-node-xxxx:

Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  5s (x7 over 43s)  kubelet  Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: , stderr: , exit code -1

The probe configuration in the calico-node-xxxx pods' yaml:

    readinessProbe:
      exec:
        command:
        - /bin/calico-node
        - -felix-ready
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    livenessProbe:
      failureThreshold: 3
      httpGet:
        host: localhost
        path: /liveness
        port: 9099
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10

When I try kubectl exec -n calico-system calico-node-xxxx -- /bin/calico-node -felix-ready && echo "$?", I can see that the exit code is 0, a success. Likewise, curl localhost:9099/liveness it gets a 200 code and the expected response. This is true even if I execute the commands within a second of creating the pods, so I doubt it has to do with the failureThreshold or timeoutSeconds etc. My understanding of how the exec command actually gets called for the readiness probes is shaky, so maybe an explanation of how it could differ from kubectl exec would point me in the right direction?

Thanks.

1 Answers1

0

Ah, it was a bit hard to track down that it was this bug in cri-o https://github.com/cri-o/cri-o/issues/6184 because I had the outdated version of conmon from the ubuntu repo. Updating conmon fixed it.