Why my new Ceph cluster status never shows 'HEALTH_OK'?

Question

I'm working on setup a Ceph cluster with Docker and image 'ceph/daemon:v3.1.0-stable-3.1-luminous-centos-7'. But after the cluster has been setup, the ceph status command never reaches HEALTH_OK. Here is my cluster's information. It has enough disk space and the network is all right.

My question are:

Why does Ceph not replicate the 'undersized' pages?
How to fix it?

Thank you very much!

➜  ~ ceph -s
  cluster:
    id:     483a61c4-d3c7-424d-b96b-311d2c6eb69b
    health: HEALTH_WARN
            Degraded data redundancy: 3 pgs undersized

  services:
    mon:        3 daemons, quorum pc-10-10-0-13,pc-10-10-0-89,pc-10-10-0-160
    mgr:        pc-10-10-0-89(active), standbys: pc-10-10-0-13, pc-10-10-0-160
    mds:        cephfs-1/1/1 up  {0=pc-10-10-0-160=up:active}, 2 up:standby
    osd:        5 osds: 5 up, 5 in
    rbd-mirror: 3 daemons active
    rgw:        3 daemons active

  data:
    pools:   6 pools, 68 pgs
    objects: 212 objects, 5.27KiB
    usage:   5.02GiB used, 12.7TiB / 12.7TiB avail
    pgs:     65 active+clean
             3  active+undersized

➜  ~ ceph osd tree
ID CLASS WEIGHT   TYPE NAME               STATUS REWEIGHT PRI-AFF
-1       12.73497 root default
-5        0.90959     host pc-10-10-0-13
 3   hdd  0.90959         osd.3               up  1.00000 1.00000
-7        0.90959     host pc-10-10-0-160
 4   hdd  0.90959         osd.4               up  1.00000 1.00000
-3       10.91579     host pc-10-10-0-89
 0   hdd  3.63860         osd.0               up  1.00000 1.00000
 1   hdd  3.63860         osd.1               up  1.00000 1.00000
 2   hdd  3.63860         osd.2               up  1.00000 1.00000
➜  ~ ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 24 flags hashpspool stripe_width 0 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 24 flags hashpspool stripe_width 0 application cephfs
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 27 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 30 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 32 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 34 flags hashpspool stripe_width 0 application rgw

Could you show the output of `ceph osd df` as well ? – itsafire Sep 17 '18 at 12:19 — itsafire, Sep 17 '18 at 12:19

score 2 · Answer 1 · answered Oct 02 '18 at 14:04

2

@itsafire This is not the solution. He is asking for solution not asking for hardware recommendation.

I'm running 8 nodes and 5 nodes multiple CEPH clusters. I always use 2 replica with multiple crush map (for SSD, SAS and 72k drives)

Why you need 3 replica if you are using a small cluster with limited resources.

Could you please explain why my solution is Recipe for disaster? You have good reputation and I'm not sure how did you get them. Maybe just replying recommendation not solution.

answered Oct 02 '18 at 14:04

Asuk Nath

570
4
10

If you only have 2 copies and one drive fails then you are left with only one drive of your precious data. If that drive fails during recovery, then you are fcked. Drive failure while doing recovery is a quite common incident resulting in data loss. – itsafire Oct 05 '18 at 11:24

score 1 · Answer 2 · answered Sep 19 '18 at 20:55

1

Create a new Pool with Size 2 and Min Size 1.
For pg-num use Ceph PG Calculator https://ceph.com/pgcalc/

answered Sep 19 '18 at 20:55

Asuk Nath

570
4
10

Recipe for disaster. – itsafire Sep 27 '18 at 05:56
Could you please explain why my solution is Recipe for disaster? – Asuk Nath Oct 02 '18 at 14:06
because if one drive fails and you are left with only one copy of your data, a subsequent failing drive will probably come with the cost of loosing your employment. CERN even proposes 3 replicas is not enough and instead are using 4 replicas. – itsafire Oct 05 '18 at 10:52
2

If and if is not solution. If three drives failed then you will also loose your employment because your job is only relaying on 3 replica. Good thing about me is I'm not going to loose employment because I own 3 datacenters. CEPH is running for customers' storage and NAS we are using for Backup local and remote location for DR. I'm not arguing with you 2 vs 3 replica. Your comment "Recipe for disaster" is not acceptable. CEPH was using default 2. Also How many percentage of servers now-a-day are still using RAID 5? Don't just go with if, if and if. – Asuk Nath Oct 06 '18 at 08:52

itsafire · Answer 3 · 2018-10-05T11:41:07.420

It seems you created a three node cluster with different osd configurations and sizes. The standard crush rule tells ceph to have 3 copies of a PG on different hosts. If there is not enough space to spread the PGs over the three hosts, then your cluster will never be healthy.

It is always a good idea to start with a set of equally sized hosts (RAM, CPU, OSDs).

Update for discussion about cluster with size of 2 vs 3

Don't use 2 replicas. Go for 3. Ceph started out with a size default of 2. But this was changed to 3 in Ceph 0.82 (Firefly release).

Why ? Because if one drive fails you are left with only one drive containing your data. Should this drive fail too while recovery is running, then your data is gone for good.

See this thread on the ceph user mailing list

2 replicas isn't safe, no matter how big or small the cluster is. With disks becoming larger recovery times will grow. In that window you don't want to run on a single replica.

Why my new Ceph cluster status never shows 'HEALTH_OK'?

3 Answers3