0

After rebooting k8s nodes, OSD didn't join the cluster with errors related to authentication. I have added them to auth list and that error disappears. Now OSD nodes join the cluster but they don't show as up and pgs don't show up in ceph -s.

I spent 2 weeks on this issue, but I don't understand why OSD don't show as up. When setting ms subsystem logging to 20, there is an error showing OSD >> MGR - Operation not permitted:

4038023360,v1:10.244.135.63:6801/4038023360] conn(0x55c2deb3a000 0x55c2dd0ee000 crc :-1 s=READY pgs=2984 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
((1) Operation not permitted)

and checking OSD status directly from its daemon:

[root@rook-ceph-osd-3-79b4cddd7f-52kwm ceph]# ceph daemon osd.3 status
{
    "cluster_fsid": "6078f23a-41af-4f36-aa54-ddc67de63c18",
    "osd_fsid": "b89817d2-3752-4f20-a916-b992990dee8d",
    "whoami": 3,
    "state": "booting",
    "oldest_map": 9094,
    "newest_map": 9677,
    "num_pgs": 33
}

What I tried so far:

  • Check dmesg | scsi - seems fine
  • Check network - exec into OSD and ping mgr and mons, OK.
  • Check iostat -x, util is low
  • Upgraded rook -> 1.9, ceph -> 17

What I'll try (unfortunately... :( ):

  • ZAP OSD disk, clear all ceph cluster components, and Redeploy

Any clues to fix this issue are very appreciated.

Ahmad Ahmadi
  • 103
  • 4

1 Answers1

0

Don't zap it! That will drop whatever was on the disk.

My guess would be something to do with permissions at the UNIX level, somewhere. Probably within the container, but it could be elsewhere.

If you can re-create the data stored in your ceph cluster, then I suppose zap & restore might work, but the problem may still persist if the container does.

  • How can I re-create the data in the cluster? And, how can I inspect the issue at UNIX level? I don't have any clue to follow. – Ahmad Ahmadi May 10 '23 at 13:37
  • By default ceph stores 3 copies of all the data it stores, so it's probably on other OSDs in a redundant capacity. If that's the case it should show up on some of the lower-level commands to look at the PGs that are having problems. – Charles Bedford Jul 03 '23 at 12:10