If 1 OSD crashes, does rook-ceph eventually tries to replicate missing data to the still workings OSDs or does it wait for all OSD to be healthy again ? Let's say yes so that I can explain how I calculated :
I started with 1,71 TB provisionned for kubernetes PVCs and 3 nodes of 745 GB each (total 2,23 TB). Rook has a replication factor of 2 (RF=2).
For the replication to work, I need 2 times 1,71 TB (3,42 TB), so I added 2 nodes 745 GB each (total 3,72 TB) Let's say I use all of the 1,71 TB provisonned.
If I lose an OSD, my K8S cluster still runs because data is replicated, but when missing data is replicated itself on still working OSD, other OSD may crash because, assuming OSDs are always equally distributed (which I know is not true in the long run) :
- I have 290 GB unused space on my cluster (3,72 total - 3,42 PVC provisionning)
- Which is 58 GB per OSD (290 / 5)
- Crashed OSD has 687 GB (745 disk total - 58 GB unused)
- Ceph tries to replicate 172 GB missing data on each OSD left (687/4)
- Which is way too much because we only have 58 GB left which should lead to OSD failures cascading
If I had 6 nodes instead of 5, I could loose 1 OSD indefinitely tho :
- New pool is 4,5 TB (6x745)
- I have 1+ TB free space on the cluster (4,5 total - 3,42 PVC provisionning)
- Which is 166+ GB per OSD (~1 TB / 6)
- Crashed OSD has 579+ GB data max. (745 - 166)
- Ceph tries to replicate less than 100 GB missing data on each OSD left (579 / 6)
- Which is less than free space on each OSD (166+ GB) so replication works again with only 5 nodes left but if another OSD crashes I'm doomed.
Is the initial assumption correct? If so, does the maths sound right to you ?