Gluster cluster in Kubernetes: Glusterd inactive (dead) after node reboot. How to debug?

Question

I don't know what to do to debug it. I have 1 Kubernetes master node and three slave nodes. I have deployed on the three nodes a Gluster cluster just fine with this guide https://github.com/gluster/gluster-kubernetes/blob/master/docs/setup-guide.md.

I created volumes and everything is working. But when I reboot a slave node, and the node reconnects to the master node, the glusterd.service inside the slave node shows up dead and nothing works after this.

[root@kubernetes-node-1 /]# systemctl status glusterd.service
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

I don't know what to do from here, for example /var/log/glusterfs/glusterd.log has been updated last time 3 days ago (it's not being updated with errors after a reboot or a pod deletion+recreation).

I just want to know where glusterd crashes so I can find out why.

How can I debug this crash?

All the nodes (master + slaves) run on Ubuntu Desktop 18 64 bit LTS Virtualbox VMs.

requested logs (kubectl get all --all-namespaces):

NAMESPACE     NAME                                                 READY   STATUS              RESTARTS   AGE
glusterfs     pod/glusterfs-7nl8l                                  0/1     Running             62         22h
glusterfs     pod/glusterfs-wjnzx                                  1/1     Running             62         2d21h
glusterfs     pod/glusterfs-wl4lx                                  1/1     Running             112        41h
glusterfs     pod/heketi-7495cdc5fd-hc42h                          1/1     Running             0          22h
kube-system   pod/coredns-86c58d9df4-n2hpk                         1/1     Running             0          6d12h
kube-system   pod/coredns-86c58d9df4-rbwjq                         1/1     Running             0          6d12h
kube-system   pod/etcd-kubernetes-master-work                      1/1     Running             0          6d12h
kube-system   pod/kube-apiserver-kubernetes-master-work            1/1     Running             0          6d12h
kube-system   pod/kube-controller-manager-kubernetes-master-work   1/1     Running             0          6d12h
kube-system   pod/kube-flannel-ds-amd64-785q8                      1/1     Running             5          3d19h
kube-system   pod/kube-flannel-ds-amd64-8sj2z                      1/1     Running             8          3d19h
kube-system   pod/kube-flannel-ds-amd64-v62xb                      1/1     Running             0          3d21h
kube-system   pod/kube-flannel-ds-amd64-wx4jl                      1/1     Running             7          3d21h
kube-system   pod/kube-proxy-7f6d9                                 1/1     Running             5          3d19h
kube-system   pod/kube-proxy-7sf9d                                 1/1     Running             0          6d12h
kube-system   pod/kube-proxy-n9qxq                                 1/1     Running             8          3d19h
kube-system   pod/kube-proxy-rwghw                                 1/1     Running             7          3d21h
kube-system   pod/kube-scheduler-kubernetes-master-work            1/1     Running             0          6d12h

NAMESPACE     NAME                                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
default       service/kubernetes                                               ClusterIP   10.96.0.1        <none>        443/TCP         6d12h
elastic       service/glusterfs-dynamic-9ad03769-2bb5-11e9-8710-0800276a5a8e   ClusterIP   10.98.38.157     <none>        1/TCP           2d19h
elastic       service/glusterfs-dynamic-a77e02ca-2bb4-11e9-8710-0800276a5a8e   ClusterIP   10.97.203.225    <none>        1/TCP           2d19h
elastic       service/glusterfs-dynamic-ad16ed0b-2bb6-11e9-8710-0800276a5a8e   ClusterIP   10.105.149.142   <none>        1/TCP           2d19h
glusterfs     service/heketi                                                   ClusterIP   10.101.79.224    <none>        8080/TCP        2d20h
glusterfs     service/heketi-storage-endpoints                                 ClusterIP   10.99.199.190    <none>        1/TCP           2d20h
kube-system   service/kube-dns                                                 ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP   6d12h

NAMESPACE     NAME                                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
glusterfs     daemonset.apps/glusterfs                 3         3         0       3            0           storagenode=glusterfs             2d21h
kube-system   daemonset.apps/kube-flannel-ds-amd64     4         4         4       4            4           beta.kubernetes.io/arch=amd64     3d21h
kube-system   daemonset.apps/kube-flannel-ds-arm       0         0         0       0            0           beta.kubernetes.io/arch=arm       3d21h
kube-system   daemonset.apps/kube-flannel-ds-arm64     0         0         0       0            0           beta.kubernetes.io/arch=arm64     3d21h
kube-system   daemonset.apps/kube-flannel-ds-ppc64le   0         0         0       0            0           beta.kubernetes.io/arch=ppc64le   3d21h
kube-system   daemonset.apps/kube-flannel-ds-s390x     0         0         0       0            0           beta.kubernetes.io/arch=s390x     3d21h
kube-system   daemonset.apps/kube-proxy                4         4         4       4            4           <none>                            6d12h

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
glusterfs     deployment.apps/heketi    1/1     1            0           2d20h
kube-system   deployment.apps/coredns   2/2     2            2           6d12h

NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
glusterfs     replicaset.apps/heketi-7495cdc5fd    1         1         0       2d20h
kube-system   replicaset.apps/coredns-86c58d9df4   2         2         2       6d12h

requested:

tasos@kubernetes-master-work:~$ kubectl logs -n glusterfs glusterfs-7nl8l
env variable is set. Update in gluster-blockd.service

I have done that, the only info there is that `/usr/local/bin/status-probe.sh` failed and that's because it tries to retrieve the status of `glusterd.service` — Tasos, Feb 11 '19 at 10:53
What's the output of : kubectl logs -n glusterfs glusterfs-7nl8l — , Feb 11 '19 at 11:27
@wrogrammer updated question (By the way gluster-blockd.service shows also as inactive (dead)) — Tasos, Feb 11 '19 at 11:30

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

Please check these similar topics:

GlusterFS deployment on k8s cluster-- Readiness probe failed: /usr/local/bin/status-probe.sh

and

https://github.com/gluster/gluster-kubernetes/issues/539

Check tcmu-runner.log log to debug it.

UPDATE:

I think it will be your issue: https://github.com/gluster/gluster-kubernetes/pull/557

PR is prepared, but not merged.

UPDATE 2:

https://github.com/gluster/glusterfs/issues/417

Be sure that rpcbind is installed.

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 11 '19 at 11:55

1

I have implemented those changes in my deployment (PR 557) and they have solved the related problems, but they do not solve the issue in the question. All these issues are related to the `gluster-blockd.service`. My `tmcu-runner.log` shows no errors and only info. I have checked the issues that are linked in this answer, but for a different problem as I said – Tasos Feb 11 '19 at 12:02
Please check: https://github.com/gluster/glusterfs/issues/417 , please be sure that you have the valid version of rpcbind package. – Feb 11 '19 at 12:12
Inside the Kubernetes virtual containers? It was installed when they were deployed by Kubernetes. That's how they were talking to each other (the slave nodes) before I started the reboots – Tasos Feb 11 '19 at 12:18
Inside POD: $ /etc/init.d/rpcbind start After starting portmap or rpcbind, gluster NFS server needs to be restarted. – Feb 11 '19 at 12:25
1

Inside the pod `rpcbind.service` already exists and is active (Running). If I attempt to restart `glusterd.service` it doesn't work (I execute `systemctl restart glusterd.service` and it never completes). – Tasos Feb 11 '19 at 12:27
Interesting. I think you should open the new Github issue in glusterfs Github repository: https://github.com/gluster/gluster-kubernetes – Feb 11 '19 at 12:38
1

Yep that's what I figured as well https://github.com/gluster/gluster-kubernetes/issues/562 – Tasos Feb 11 '19 at 12:39

Gluster cluster in Kubernetes: Glusterd inactive (dead) after node reboot. How to debug?

1 Answers1

UPDATE:

UPDATE 2: