1

The stage: a control plane machine, kubernetes 1.24.3 on a baremetal Ubuntu 22.04, installed with kubeadm, there is also one worker node. The whole set-up worked like a charm for 4 months until some unknown silent kaboom yesterday (I actually don't exclude a sudden hardware issue)

The problem: port 6443 is listed by netstat for the first few minutes after the control plane machine start-up, and then disappears. While it's on, the apiserver is irresponsive anyway - any connection attempt to it is reset by the peer. I.e. there should be some serious problems on the kube-apiserver side, but I can't figure out what it's unhappy with.

I checked some obvious things - ip address didn't change, enough disk space, k8s' certificates are not expired. So I need to check kube-apiserver logs somehow.

As for the logs, the official page says:

On systemd-based systems, you may need to use journalctl instead of examining log files.

But... what component should I run journalctl for?? If I run it for kubelet (journalctl -u kubelet), I don't see much of the logs related to apiserver apart from "can't connect to :6443"

And I don't see any service named kube-apiserver or alike when I run e.g. just systemctl... Also, there are no logs in /var/log/ (not surprising since it's a systemd-based system, but I checked nevertheless)

I wonder is there a way to check the apiserver's logs, or is there some gotcha that I'm missing? Would appreciate any help on this subject!

Mikha
  • 123
  • 4
  • @larsks uhhuh! I use containerd, so got it... that's a good lead, thanks. I will try it on Monday, will write the result down here. Thanks! – Mikha Oct 23 '22 at 01:05
  • 1
    @larsks You were absolutely right. Looking into pod logs on the control plane helped. TL/DR: etcd data corruption So logs for a pod with kube-apiserver I saw this: "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused" 2379 turns out to be an ETCD port, and its pod log is showing this: "error":"walpb: crc mismatch". There is enough disk space. Might be a bug somewhere, but probably a HW issue, as noticed e.g. here: https://groups.google.com/g/coreos-dev/c/fRm2o58_N1U?pli=1 As I understand, I can't recover the etcd, so reinstall it is. – Mikha Oct 24 '22 at 00:34
  • @larsks if you put what you already wrote as an answer, I'll be happy to mark it as an accepted one. Thanks! – Mikha Oct 24 '22 at 00:36
  • Glad it helped! I've moved my comment to an answer. – larsks Oct 24 '22 at 00:39

1 Answers1

1

On my clusters, kube-apiserver runs in a pod (and logs are available via kubectl logs).

If the apiserver was down, I would log into a node directly and use the container runtime to examine the container logs (e.g., crictl logs or docker logs or whatever is appropriate for your system).

For example:

# Find the kube-apiserver pod
root@infra-control-plane:/# crictl pods | grep kube-apiserver
ca01411d447a6       3 days ago          Ready               kube-apiserver-infra-control-plane                  kube-system          0                   (default)
# Find container in that pod
root@infra-control-plane:/# crictl ps --pod ca01411d447a6
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
44da6bf198244       400e6a4878256       3 days ago          Running             kube-apiserver      0                   ca01411d447a6       kube-apiserver-infra-control-plane
# Look at logs for that container
root@infra-control-plane:/# crictl logs 44da6bf198244 |& tail -2
Trace[1944612718]: ---"Writing http response done" 1254ms (21:59:45.444)
Trace[1944612718]: [1.255113794s] [1.255113794s] END
larsks
  • 43,623
  • 14
  • 121
  • 180