Make ceph minimize spread of file parts over OSDs

Question

I am considering an option of ceph as distributed filesystem for my home-made MAID (massive array of idle drives).

As far as I understand, Ceph oriented for cluster use and spread data evenly over OSDs (with respect to CRUSH maps) and tries to utilize parallelism of read operations over different nodes.

In my case I don't need to maximize spread and throughput, in ideal case it should fill first N OSDs (where N is replication factor) and only then start filling next N OSDs, to minimize amount of required active drives for adjacent data retrieval.

Can I somehow achieve such behaviour by tweaking placement groups count and CRUSH maps? Or if it is not possible can I at least make ceph stop splitting files into more than one block?

From my experience, I don't think Ceph cannot do this (although I never actually tried, but I can't imagine how you would achieve that). It was designed as being a distributed object storage, not something else. As far as I can tell, you are not really looking for something distributed? I would point you to a much simpler solution such as thin allocated LVM which kind of does something like what you are looking for. — Florin Asăvoaie, Nov 25 '15 at 23:04
I'm looking for something distributed, because HDDs will be on different machines in the network. I could use LVM over NBD, but it is not as flexible as ceph storage, in case that i still should introduce some fs over LVM block, so if i will add or remove HDDs from system it will lead to some mess with fs resizing. — gordon-quad, Nov 26 '15 at 10:41

score 1 · Accepted Answer · edited Jun 11 '20 at 10:02

I don't think something similar to what you want to achieve is possible with ceph. As far as I understand, ceph is a distributed file system and that it ensures high fault tolerance by using replication. Read here:

Ceph aims primarily to be completely distributed without a single point of failure, scalable to the exabyte level, and freely available.

ceph's power is it's scalability and high availability:

Scalability and High Availability

In traditional architectures, clients talk to a centralized component (e.g., a gateway, broker, API, facade, etc.), which acts as a single point of entry to a complex subsystem. This imposes a limit to both performance and scalability, while introducing a single point of failure (i.e., if the centralized component goes down, the whole system goes down, too).

Ceph eliminates the centralized gateway to enable clients to interact with Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster of monitors to ensure high availability. To eliminate centralization, Ceph uses an algorithm called CRUSH.

What I'm trying to point out is that, ceph is made to take care of the physical disk's usage in a cluster environment in a way to ensure more resilience, high availability and transparency. Not quiet what you are looking for.

If you are worried about performance or disk I/O, there is this option called Primary Affinity, which can be employed, for example to prioritze SAAS disks over SATA. Read more here and here.

Primary Affinity

When a Ceph Client reads or writes data, it always contacts the primary OSD in the acting set. For set [2, 3, 4], osd.2 is the primary. Sometimes an OSD isn’t well suited to act as a primary compared to other OSDs (e.g., it has a slow disk or a slow controller). To prevent performance bottlenecks (especially on read operations) while maximizing utilization of your hardware, you can set a Ceph OSD’s primary affinity so that CRUSH is less likely to use the OSD as a primary in an acting set.

ceph osd primary-affinity <osd-id> <weight>

Primary affinity is 1 by default (i.e., an OSD may act as a primary). You may set the OSD primary range from 0-1, where 0 means that the OSD may NOT be used as a primary and 1 means that an OSD may be used as a primary. When the weight is < 1, it is less likely that CRUSH will select the Ceph OSD Daemon to act as a primary.

I know this doesn't exactly answer all your questions, but may be provide some food for thought.

See details here: http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity

And here is nice blog explaining the ceph cluster.

will it access second osd if first will be full? Or this OSDs set is calculated for each object individualy? — gordon-quad, Nov 27 '15 at 17:20
No it wont. This more to improve performance. See my updated answer. — Diamond, Nov 27 '15 at 22:03

Make ceph minimize spread of file parts over OSDs

1 Answers1