Why Ceph turns status to Err when there is still available storage space

Question

I built a 3 node Ceph cluster recently. Each node had seven 1TB HDD for OSDs. In total, I have 21 TB of storage space for Ceph.

However, when I ran a workload to keep writing data to Ceph, it turns to Err status and no data can be written to it any more.

The output of ceph -s is:

 cluster:
    id:     06ed9d57-c68e-4899-91a6-d72125614a94
    health: HEALTH_ERR
            1 full osd(s)
            4 nearfull osd(s)
            7 pool(s) full

  services:
    mon: 1 daemons, quorum host3
    mgr: admin(active), standbys: 06ed9d57-c68e-4899-91a6-d72125614a94
    osd: 21 osds: 21 up, 21 in
    rgw: 4 daemons active

  data:
    pools:   7 pools, 1748 pgs
    objects: 2.03M objects, 7.34TiB
    usage:   14.7TiB used, 4.37TiB / 19.1TiB avail
    pgs:     1748 active+clean

Based on my comprehension, since there is still 4.37 TB space left, Ceph itself should take care about how to balance the workload and make each OSD to not be at full or nearfull status. But the result doesn't work as my expectation, 1 full osd and 4 nearfull osd shows up, the health is HEALTH_ERR.

I can't visit Ceph with hdfs or s3cmd anymore, so here comes the question:
1, Any explanation about current issue?
2, How can I recover from it? Delete data on Ceph node directly with ceph-admin, and relaunch the Ceph?

score 3 · Answer 1 · answered Sep 30 '19 at 08:51

Not get an answer for 3 days and I made some progress, let me share my findings here.

1, It's normal for different OSD to have size gap. If you list OSD with ceph osd df, you will find that different OSD has different usage ratio.

2, To recover from this issue, the issue here means the cluster crush due to OSD full. Follow steps below, it's mostly from redhat.

Get ceph cluster health info by ceph health detail. It's not necessary but you can get the ID of failed OSD.
Use ceph osd dump | grep full_ratio to get current full_ratio. Do not use statement listed at above link, it's obsoleted. The output can be like

full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.85

Set OSD full ratio a little higher by ceph osd set-full-ratio <ratio>. Generally, we set ratio to 0.97
Now, the cluster status will change from HEALTH_ERR to HEALTH_WARN or HEALTH_OK. Remove some data that can be released.
Change OSD full ratio back to previous ratio. It can't be 0.97 always cause it's a little risky.

Hope this thread is helpful to some one who ran into same issue. The details about OSD configuration please refer to ceph.

score 2 · Accepted Answer · answered Nov 08 '19 at 17:08

Ceph requires free disk space to move storage chunks, called pgs, between different disks. As this free space is so critical to the underlying functionality, Ceph will go into HEALTH_WARN once any OSD reaches the near_full ratio (generally 85% full), and will stop write operations on the cluster by entering HEALTH_ERR state once an OSD reaches the full_ratio.

However, unless your cluster is perfectly balanced across all OSDs there is likely much more capacity available, as OSDs are typically unevenly utilized. To check overall utilization and available capacity you can run ceph osd df.

Example output:

ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL    %USE  VAR  PGS STATUS 
 2   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  72 MiB 3.6 GiB  742 GiB 73.44 1.06 406     up 
 5   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB 119 MiB 3.3 GiB  726 GiB 74.00 1.06 414     up 
12   hdd 2.72849  1.00000 2.7 TiB 2.2 TiB 2.2 TiB  72 MiB 3.7 GiB  579 GiB 79.26 1.14 407     up 
14   hdd 2.72849  1.00000 2.7 TiB 2.3 TiB 2.3 TiB  80 MiB 3.6 GiB  477 GiB 82.92 1.19 367     up 
 8   ssd 0.10840        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
 1   hdd 2.72849  1.00000 2.7 TiB 1.7 TiB 1.7 TiB  27 MiB 2.9 GiB 1006 GiB 64.01 0.92 253     up 
 4   hdd 2.72849  1.00000 2.7 TiB 1.7 TiB 1.7 TiB  79 MiB 2.9 GiB 1018 GiB 63.55 0.91 259     up 
10   hdd 2.72849  1.00000 2.7 TiB 1.9 TiB 1.9 TiB  70 MiB 3.0 GiB  887 GiB 68.24 0.98 256     up 
13   hdd 2.72849  1.00000 2.7 TiB 1.8 TiB 1.8 TiB  80 MiB 3.0 GiB  971 GiB 65.24 0.94 277     up 
15   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  58 MiB 3.1 GiB  793 GiB 71.63 1.03 283     up 
17   hdd 2.72849  1.00000 2.7 TiB 1.6 TiB 1.6 TiB 113 MiB 2.8 GiB  1.1 TiB 59.78 0.86 259     up 
19   hdd 2.72849  1.00000 2.7 TiB 1.6 TiB 1.6 TiB 100 MiB 2.7 GiB  1.2 TiB 56.98 0.82 265     up 
 7   ssd 0.10840        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
 0   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB 105 MiB 3.0 GiB  734 GiB 73.72 1.06 337     up 
 3   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  98 MiB 3.0 GiB  781 GiB 72.04 1.04 354     up 
 9   hdd 2.72849        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
11   hdd 2.72849  1.00000 2.7 TiB 1.9 TiB 1.9 TiB  76 MiB 3.0 GiB  817 GiB 70.74 1.02 342     up 
16   hdd 2.72849  1.00000 2.7 TiB 1.8 TiB 1.8 TiB  98 MiB 2.7 GiB  984 GiB 64.80 0.93 317     up 
18   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  79 MiB 3.0 GiB  792 GiB 71.65 1.03 324     up 
 6   ssd 0.10840        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
                    TOTAL  47 TiB  30 TiB  30 TiB 1.3 GiB  53 GiB   16 TiB 69.50                 
MIN/MAX VAR: 0.82/1.19  STDDEV: 6.64

As you can see in the above output, the used OSDs vary from 56.98% (OSD 19) to 82.92% (OSD 14) utilized, which is a significant variance.

As only a single OSD is full, and only 4 of your 21 OSD's are nearfull you likely have a significant amount of storage still available in your cluster, which means that it is time to perform a rebalance operation. This can be done manually by reweighting OSDs, or you can have Ceph do a best-effort rebalance by running the command ceph osd reweight-by-utilization. Once the rebalance is complete (i.e you have no objects misplaced in ceph status) you can check for the variation again (using ceph osd df) and trigger another rebalance if required.

If you are on Luminous or newer you can enable the Balancer plugin to handle OSD rewighting automatically.

Most welcome. If this answers your question can you please accept it as the answer? — chaosaffe, Nov 11 '19 at 12:22

Why Ceph turns status to Err when there is still available storage space

2 Answers2