I am unsure if this is the platform to ask. But hopefully it is :).
I've got a 3 node setup of ceph.
node1
mds.node1 , mgr.node1 , mon.node1 , osd.0 , osd.1 , osd.6
14.2.22
node2
mds.node2 , mon.node2 , osd.2 , osd.3 , osd.7
14.2.22
node3
mds.node3 , mon.node3 , osd.4 , osd.5 , osd.8
14.2.22
For some reason though, When I down one node, It does not start backfilling/recovery at all. It just reports 3 osd's down as below. But does nothing to repair it....
If I run a ceph -s
I get the below ouput:
[root@node1 testdir]# ceph -s
cluster:
id: 8932b76b-282b-4385-bee8-5c295af88e74
health: HEALTH_WARN
3 osds down
1 host (3 osds) down
Degraded data redundancy: 30089/90267 objects degraded (33.333%), 200 pgs degraded, 512 pgs undersized
1/3 mons down, quorum node1,node2
services:
mon: 3 daemons, quorum node1,node2 (age 2m), out of quorum: node3
mgr: node1(active, since 48m)
mds: homeFS:1 {0=node1=up:active} 1 up:standby-replay
osd: 9 osds: 6 up (since 2m), 9 in (since 91m)
data:
pools: 4 pools, 512 pgs
objects: 30.09k objects, 144 MiB
usage: 14 GiB used, 346 GiB / 360 GiB avail
pgs: 30089/90267 objects degraded (33.333%)
312 active+undersized
200 active+undersized+degraded
io:
client: 852 B/s rd, 2 op/s rd, 0 op/s wr
[root@node1 testdir]#
The odd thing though, when I boot up my 3rd node again it does recover and sync. But it looks like it's backfilling just not starting at all... Is there something that might be causing it?
Update What I did notice, If I mark a drive as out, it does recover it... But when the server node's down, and the drive's marked as out, it then does not recover it at all...
Update 2: I noticed while experimenting that if the OSD is up, but out, It does recover... When the OSD is marked as down it does not begin to recover at all...