Ceph's failure to automount after network failure

Question

I'm having some problems with the mounting of a ceph-cluster on debian machines, don't know if I'm doing something wrong, if it's a version problem or anything else.

I'm using the ceph cluster from OVH, and then mounting with fstab on around 20 vm's ( 2 bare metal servers with a proxmox instance on each one ).

The problem appears when there is some network failure between the ceph cluster and our bare metal, from that point on, the mounts of ceph are completely unusable. Versions being used, and can only be brought back to use if I restart the server.

Ceph-Cluster: 14.2.16
Debian 10 Buster
Ceph installed on debian: 14.2.21 nautiles ( stable )

Ceph configuration:

[global]
fsid = xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
mon_host = XX.XX.XXX.XX XX.XX.XXX.XX XX.XX.XXX.XX

fstab configuration:

:/     /mnt/ceph     ceph     name=ceph_user,_netdev,noatime        0     0

Running mount:

xx.xx.xx.xx:6789,xx.xx.xx.xx:6789,xx.xx.xx.xx:6789:/ on /mnt/ceph type ceph (rw,noatime,name=ceph_user,secret=<hidden>,acl)

Edit just happened now, so adding some more info:

When this happens, this is what appears when I try ls the folder /mnt/:

d????????? ? ?    ?       ?            ? ceph

If I try mount -a:

mount error 16 = Device or resource busy

Log from /var/log/messages:

Jul 23 21:48:27 prod7-2 kernel: [28344.425057] libceph: mon2 xx.xx.xxx.xx:6789 session lost, hunting for new mon
Jul 23 21:48:27 prod7-2 kernel: [28344.427340] libceph: mon1 xx.xx.xxx.xx:6789 session established
Jul 23 21:48:54 prod7-2 kernel: [28371.560529] ceph: mds0 caps stale
Jul 23 21:52:53 prod7-2 kernel: [28610.660328] ceph: mds0 hung
Jul 23 21:53:25 prod7-2 kernel: [28642.659775] libceph: mon1 xx.xx.xxx.xx:6789 session lost, hunting for new mon
Jul 23 21:53:25 prod7-2 kernel: [28642.677667] libceph: mon0 xx.xx.xxx.xx:6789 session established
Jul 23 21:53:39 prod7-2 kernel: [28656.231175] libceph: mds0 xx.xx.xxx.xx:6801 socket closed (con state OPEN)
Jul 23 21:53:40 prod7-2 kernel: [28657.459175] libceph: reset on mds0
Jul 23 21:53:40 prod7-2 kernel: [28657.459179] ceph: mds0 closed our session
Jul 23 21:53:40 prod7-2 kernel: [28657.459180] ceph: mds0 reconnect start
Jul 23 21:53:40 prod7-2 kernel: [28657.498027] ceph: mds0 reconnect denied
Jul 23 21:53:40 prod7-2 kernel: [28657.513419] libceph: mds0 xx.xx.xxx.xx:6801 socket closed (con state NEGOTIATING)
Jul 23 21:53:41 prod7-2 kernel: [28658.454421] ceph: mds0 rejected session

Am I doing something wrong? Thanks

Is it the column symbol `:` in `fstab` mistake or this is the real record? — Romeo Ninov, Jul 23 '23 at 17:04
Unfortunately, there are very few options to bring a stale mount back, usually a reboot is actually the best option. You could try it with `umount -l` though, but since ceph is a network storage system this won't be the only issue you can expect if you have regular network outages. This can easily lead to corrupt PGs and data loss. I would recommend to find out the root cause of the network issues. — eblock, Jul 25 '23 at 12:31
@eblock the problem seems to be something on the infrastructure of OVH, I guess they do some maintenance, or something like that and the connection simply goes down for some time which prevents the reconnection. — Tio, Jul 25 '23 at 15:44

Ceph's failure to automount after network failure

0 Answers0