What's the best practice of improving IOPS of a CEPH cluster?

Question

I am currently building a CEPH cluster for a KVM platform, which got catastrophic performance outcome right now. The figure is dreadful. I am not really familiar with physically distributed systems, is there any general advice of improving the overall performance (i.e. latency, bandwidth and IOPS)?

The hardware configuration is not optimal right now, but I am still would like to release the full potential of what I currently got:

1x 10Gbe Huawei switch

3x Rack server, with hardware configuration:

Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz x2, totally 48 logical cores, 

128GB DDR3 RAM

Intel 1.84T NVMe SSD x6 as data drive, with 1 OSD per disk (totally 6 OSDs per server)

My current /etc/ceph/ceph.conf:

[global]
fsid = f2d6d3a7-0e61-4768-b3f5-b19dd2d8b657
mon initial members = ceph-node1, ceph-node2, ceph-node3
mon allow pool delete = true
mon host = 192.168.16.1, 192.168.16.2, 192.168.16.3
public network = 192.168.16.0/24
cluster network = 192.168.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 600
osd pool default pgp num = 600
osd memory target = 4294967296
max open files = 131072

[mon]
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600

[OSD]
osd journal size = 20000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000

[Client]
rbd cache = True
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = False
rbd cache max dirty object = 2
rbd cache target dirty = 235544320

IO benchmark is done by fio, with the configuration: fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randread -size=100G -filename=/data/testfile -name="CEPH Test" -iodepth=8 -runtime=30

Benchmark result screenshot: The bench mark result

The benchmark was done on a sperate machine, configured to connect the cluster via 10Gbe switch by installing MDS only. The benchmark machine is identical to other 3 which formed the cluster, apart from the absence of Intel NVMe SSD drives.

Any help is appreciated,

This is not an openstack issue. Openstack is (at most) only peripherally involved in any of the potential causes of your performance issues. — Stephen C, Jul 04 '20 at 03:52
It would be nice if you could provide some feedback here on how you managed to increase the performance. As it is now, your question doesn't serve to provide much value. — Lifeboy, Jul 30 '23 at 11:43

score 2 · Answer 1 · answered Oct 06 '20 at 00:30

First, I must note that Ceph is not an acronym, it is short for Cephalopod, because tentacles.

That said, you have a number of settings in ceph.conf that surprise me, like the extreme number of osdmaps you're caching. The thread settings can be tricky, and vary in applicability between releases. If you're building pools with 600 PGs that isn't great, you generally want a power of 2, and to target a ratio per OSD that factors in drive type and other pools. Setting the mon clock skew to a full second (vs. the default 50ms) is downright alarming, with Chrony or even the legacy ntpd it's not hard to get sub-millisecond syncing.

Three nodes may be limiting in the degree of parallelism / overlap clients can support, especially since you only have 6 drives per server. That's only 18 OSDs.

You have Filestore settings in there too, you aren't really using Filestore are you? Or a Ceph release older than Nautilus?

Finally, as more of an actual answer to the question posed, one simple thing you can do is to split each NVMe drive into two OSDs -- with appropriate pgp_num and pg_num settings for the pool.

ceph-volume lvm batch –osds-per-device 2

score 0 · Answer 2 · answered Jul 01 '20 at 14:12

I assume you 3 meant 3 servers blades and not 3 racks.

What was you rough estimate of performance ? What is the performance profile of your disk hardware (outside ceph) at 4K and 2MB ? How many disk do you have in this pool, what is replication factor/strategy and object size ?

On the client side you are performing small reads: 4K On the server side, depending on your read-ahead settings and object size each of this 4K may grasp much more data in the background.

Did you check that one of your disk is really at its limits and there is no Network/cpu throttling?

score 0 · Answer 3 · answered Nov 22 '22 at 17:23

0

You can partition your drive with lvm and use multiple OSDs per drive. Since you have so many cores per server, one osd per drive is not making use of them

answered Nov 22 '22 at 17:23

Andreas

1

What's the best practice of improving IOPS of a CEPH cluster?

3 Answers3