How to remove multple ceph nodes slowly

Question

In a 20 node cluster with 10 OSDs per node, how would you remove nodes 1-5. If I reweight the OSDs on node1 then data will move to node2-node20. Then do the same for node2 through node5.

Is there a way to put the OSDs in node2-node5 in a read-only state so data doesn't have to move twice?

When you re-weight OSDs to a lower weight data is always moved to *all* other OSDs. — itsafire, May 31 '20 at 14:50

itsafire · Answer 1 · 2022-08-24T12:09:05.990

So you want to evict all data from node 1 to 5. This is possible if the remaining nodes can fit all the data with retaining all copies for redundancy.

If this is the case you can reweight all OSDs on node 1 to 5 to zero weight at the same time. With every reweight commond given ceph will recalculate the needed movement. This way the whole data movement is done in one step. Then just sit and wait until the re-balance is finished.

The speed of re-balance is the same like only re-balancing one node, because the objects are re-weighted with the same batch size in the re-balance operation regardless of how many re-weight parameters are changed. Every change results in a new crush map. The cluster will converge towards this new crush map. There are ceph setting to control how much data is moved in parallel.

Alternatively there is also a script that will drain the OSDs at a even slower rate.

After re-balancing you have to remove the OSDs and any remaining ceph services from the evicted nodes. Then you can decommission the old nodes. done.

As a side node: You write, that by re-weighting all OSDs on node 1, data will be moved to node 2. This is not completely correct. Correct is: Data will move to all remaining nodes.

Here is example how I drain OSDs:

First check the OSD tree:

root@odroid1:~# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME           STATUS  REWEIGHT  PRI-AFF
 -1         51.93213  root default
 -7         12.73340      host miniceph1
  3    hdd  12.73340          osd.3           up   1.00000  1.00000
 -5         12.73340      host miniceph2
  1    hdd  12.73340          osd.1           up   1.00000  1.00000
 -3         12.73340      host miniceph3
  0    hdd  12.73340          osd.0           up   1.00000  1.00000
-13          2.72899      host miniceph4
  4    hdd   2.72899          osd.4           up   1.00000  1.00000
-11          2.72899      host miniceph5
  5    hdd   2.72899          osd.5           up   1.00000  1.00000
 -9          2.72899      host miniceph6
  2    hdd   2.72899          osd.2           up   1.00000  1.00000
-25          1.86299      host odroid1
  7    ssd   1.86299          osd.7           up   1.00000  1.00000
-22          1.81898      host odroid2
  6    ssd   1.81898          osd.6           up   1.00000  1.00000
-28          1.86299      host odroid3
  8    ssd   1.86299          osd.8           up   1.00000  1.00000

In this example the miniceph OSDs contain HDDs and odroid OSDs SSDs.

I like to remove miniceph 2, 4 and 5. These contain smaller disks I'd like to replace. But before removing the the OSDs I'd like to move the data to the remaining miniceph OSDs. That way all data will stay fully replicated. Just removing an OSD will result in a fail scenario.

For this I issue:

root@odroid1:~# ceph osd reweight osd.4 0
reweighted osd.4 to 0 (0)
root@odroid1:~# ceph osd reweight osd.5 0
reweighted osd.5 to 0 (0)
root@odroid1:~# ceph osd reweight osd.2 0
reweighted osd.2 to 0 (0)

And now the data is moved to the remaining OSDs.

root@odroid1:~# ceph -s
  cluster:
    id:     51c02ed5-2025-4ae7-91d6-fb5450c4b4d7
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            Degraded data redundancy: 4/4707252 objects degraded (0.000%), 1 pg degraded, 2 pgs undersized

  services:
    mon: 3 daemons, quorum odroid2,odroid3,odroid1 (age 24h)
    mgr: odroid1(active, since 24h), standbys: odroid2
    mds: cephfs:1 {0=odroid1=up:active} 1 up:standby
    osd: 9 osds: 9 up (since 24h), 6 in (since 4m); 168 remapped pgs
    rgw: 2 daemons active (odroid1.rgw0, odroid2.rgw0)

  task status:

  data:
    pools:   12 pools, 401 pgs
    objects: 1.57M objects, 1.8 TiB
    usage:   4.2 TiB used, 40 TiB / 44 TiB avail
    pgs:     4/4707252 objects degraded (0.000%)
             1156448/4707252 objects misplaced (24.567%)
             232 active+clean
             165 active+remapped+backfill_wait
             1   active+recovering
             1   active+remapped+backfilling
             1   active+recovery_wait+undersized+degraded+remapped
             1   active+recovering+undersized+remapped

  io:
    recovery: 9.8 MiB/s, 2 objects/s

  progress:
    Rebalancing after osd.4 marked out (7m)
      [===========.................] (remaining: 9m)
    Rebalancing after osd.5 marked out (6m)
      [=========...................] (remaining: 13m)
    Rebalancing after osd.2 marked out (4m)
      [===========.................] (remaining: 6m)

I understand that reweighting all 5 nodes at the same time will unload them. I didn't want to flood the system with all 5 at once. I wanted to remove 1 node at a time. — Mark Frenette, Jun 24 '20 at 13:42
It doesn't matter. Because even if you re-weight all nodes, the process of movement is done PG per PG. How much PGs are moved in parallel is controlled by settings. My cluster always moves up to 2 PGs at the same time. So the strain on your system is the same when re-weighting one or many nodes. Of course you can remove 1 node at a time. Re-weight the OSDs of the node to zero. The strain to the system will be the same. Only it will finish faster, there is less data to be moved around. But understand: It doesn't matter how much nodes you re-weight. the used bandwidth/cpu will be the same. — itsafire, Jun 30 '20 at 09:05
There is a script at GitHub that will drain your OSDs at even a slower rate. I have not tested it yet, but the Project on GitHub seems to be under active development. See my answer for the link. — itsafire, Jun 30 '20 at 10:12

How to remove multple ceph nodes slowly

1 Answers1