0

If 1 OSD crashes, does rook-ceph eventually tries to replicate missing data to the still workings OSDs or does it wait for all OSD to be healthy again ? Let's say yes so that I can explain how I calculated :

I started with 1,71 TB provisionned for kubernetes PVCs and 3 nodes of 745 GB each (total 2,23 TB). Rook has a replication factor of 2 (RF=2).

For the replication to work, I need 2 times 1,71 TB (3,42 TB), so I added 2 nodes 745 GB each (total 3,72 TB) Let's say I use all of the 1,71 TB provisonned.

If I lose an OSD, my K8S cluster still runs because data is replicated, but when missing data is replicated itself on still working OSD, other OSD may crash because, assuming OSDs are always equally distributed (which I know is not true in the long run) :

  • I have 290 GB unused space on my cluster (3,72 total - 3,42 PVC provisionning)
  • Which is 58 GB per OSD (290 / 5)
  • Crashed OSD has 687 GB (745 disk total - 58 GB unused)
  • Ceph tries to replicate 172 GB missing data on each OSD left (687/4)
  • Which is way too much because we only have 58 GB left which should lead to OSD failures cascading

If I had 6 nodes instead of 5, I could loose 1 OSD indefinitely tho :

  • New pool is 4,5 TB (6x745)
  • I have 1+ TB free space on the cluster (4,5 total - 3,42 PVC provisionning)
  • Which is 166+ GB per OSD (~1 TB / 6)
  • Crashed OSD has 579+ GB data max. (745 - 166)
  • Ceph tries to replicate less than 100 GB missing data on each OSD left (579 / 6)
  • Which is less than free space on each OSD (166+ GB) so replication works again with only 5 nodes left but if another OSD crashes I'm doomed.

Is the initial assumption correct? If so, does the maths sound right to you ?

GuiFP
  • 71
  • 7

1 Answers1

3

First: if you value your data, don't use replication with size 2! You will eventually have issues leading to data loss.

Regarding your calculation: Ceph doesn't distribute every MB of data evenly across all nodes, there will be differences between your OSDs. Because of that the OSD with the most data will be your bottleneck regarding free space and the capacity to rebalance after a failure. Ceph also doesn't handle full or near full clusters very well, your calculation is very close to a full cluster, that will lead to new issues. Try avoiding a cluster with more than 85 or 90 % used capacity, plan ahead and use more disks to both avoid a full cluster and also have a higher failure resistency. The more OSDs you have the less impact a single disk failure will have on the rest of the cluster.

And regarding recovery: ceph usually tries to recovery automatically but it depends on your actual crushmap and the rulesets your pools are configured with. For example, if you have a crush tree consisting of 3 racks and your pool is configured with size 3 (so 3 replicas in total) spread across your 3 racks (failure-domain = rack), then a whole rack fails. In this example ceph won't be able to recover the third replica until the rack is online again. The data is still available to clients and all, but your cluster is in a degraded state. But this configuration has to be done manually so it probably won't apply to you, I just wanted to point out how that works. The default usually is a pool with size 3 with host as failure-domain.

eblock
  • 579
  • 3
  • 5
  • Thanks for your answer. I'm not sure about ceph behavior though, does it try to to replicate again data from the crashed OSD on the working ones or does replication only apply to a healthy cluster? – GuiFP Jan 29 '21 at 13:27
  • 1
    No, that's the concept of ceph, to recover from broken disks and to always have enough healthy replicas (or chunks in case of EC pools). – eblock Jan 29 '21 at 15:14
  • so basically if you know that you will not be able to repair a crashed OSD quickly, you better have the more free space as possible to eventually support the crash of multiple OSDs, or even better : set up a EC pool, right ? – GuiFP Jan 29 '21 at 15:32
  • 1
    An EC pool doesn't make it better (especially performance-wise), you can just save some space using EC, but it really depends on your actual setup and resiliency requirements. It requires proper planning with correct crush rules and reasonable failure-domains. And yes, never let your cluster get full or even nearful, that would prevent a recovery in case of a disk or host failure. – eblock Jan 29 '21 at 15:39