Ceph storage OSD disk upgrade (replace with larger drive)

Question

I have three servers each with 1 x SSD drive (Ceph base OS) and 6 x 300Gb SAS drives, at the moment I'm only using 4 drives on each server as the OSD's in my Ceph storage array and everything is fine. My question is that now I have built this and got everything up and running if say in 6 months or so I need to replace these OSD's due to the space of the storage array running out is it possible to remove one disk at a time from each server and replace it with a large drive?

For example if server 1 had OSD 0-5, server 2 has OSD 6-11 and server 3 has OSD 12-17 could I one day remove OSD0 and replace it with a 600Gb SAS drive, wait for it to heal the do the same with OSD6 then OSD12 etc. etc. until all the disks are replaced, and would this then give me a large storage pool?

Hi. I'm not a Ceph professional too, but I saw that your question remained unanswered, so here's my point: I think if your placement rules enforces that not all replicas of an object is stored on a disk, you're safe to do this. To do so, you must also have the size of your pools more to be than one. Again, I am not experienced much in Ceph, but theoretically this mustn't raise a problem to your upgrade. — Ali Tou, Aug 10 '20 at 06:51
Hi Ali, Thanks for the input. I'm going to give it ago as I've got a few old servers around doing nothing. Just need to order some more drives to test the upgrade in drive size. — Scott McKeown, Aug 10 '20 at 11:37

score 2 · Accepted Answer · edited Jul 29 '21 at 20:27

2

OK just for anyone that is looking for this answer in the future you can upgrade your drives in the way that I mention above and here are the steps that I have taken (please note that this is in a lab and not production)

Mark the OSD as down
Mark the OSD as Out
Remove the drive in question
Install new drive (must be either the same size or larger)
I needed to reboot the server in question for the new disk to be seen by the OS
Add the new disk into Ceph as normal
Wait for the cluster to heal then repeat on a different server

I have now done this with 6 out of my 15 drives over 3 servers and each time the size of the Ceph storage has increase a little (I'm only doing 320G drives to 400Gb drives as this is only a test and I have some of these not in use).

I plan on starting this on the live production servers next week now that I know it works and going from 300G to 600G drives I should see a larger increase in storage (I hope).

edited Jul 29 '21 at 20:27

Dharman

30,962
25
85
135

answered Aug 14 '20 at 12:54

Scott McKeown

141
1
8

isn't it better to mark the disk as out, wait for the misplaced pg's to be moved, then mark the disk as down and replace it? This way you never have degraded pg's (But the entire process might take longer, and i guess when you have replication factor 3 you trust the other 2 copies whilst the 3rd one is rebuilt?) – Jens Timmerman Dec 06 '22 at 11:56
I honestly don't know. I'm in the process of building another Ceph system at the moment so I could test this but I don't really know how to bench mark the outcome. – Scott McKeown Dec 28 '22 at 12:21
I'm now doing this using a middle way, marked the osd as out, this marks the pg's as misplaced waited a while to get a buch of them replaced on the correct place, marked it as down, this made pg's degraded and starts the recovery, however I still have the disk in place, if I mark it back up the pg's that aren't written to are again misplaced, not degraded (most of my data is write once) So When another osd were to fail and the pg's are now really gone I can bring the one I marked as down back up and there is a way to reconstruct the pg's again. This way I don't have to run degraded to long. – Jens Timmerman Dec 29 '22 at 13:48
Just to follow up on this - I would agree that marking a disk OUT first and allowing the cluster to 'repair' before marking it down and destroying the drive looks to be the better option. Yes it did take a little longer but ceph didn't complain as much. – Scott McKeown Jan 03 '23 at 22:45

score 0 · Answer 2 · answered Jun 06 '23 at 13:59

If you have a lot of disks in your servers, and you want to upgrade all your disks, I believe that it is also possible to drain 1 host at a time:

Select a host and drain it (GUI-->Cluster-->Hosts-->Select Host-->Start drain)
Wait for the drain to finish
Shutdown the host (or not if the disks are hotplugable)
Replace all disks of the host with the bigger ones.
Remove the _no-schedule tag from the host and let ceph recreate the services
Let ceph recreate the OSDs (or create them yourself if necessary)
Wait for the cluster to be in a healthy state again.
Repeat with the other hosts.

You have a GUI! - In all seriousness, I guess if you don't have Hot Swap drives then yes this would have to be the way to upgrade your cluster. — Scott McKeown, Jun 08 '23 at 13:05

Ceph storage OSD disk upgrade (replace with larger drive)

2 Answers2