cheap, commodity hardware based, highly available distributed filesystem

Question

I have two PCs running linux, a 2 TB disk each and one small gigabit switch. To build a highly available system on the cheap, I resorted to this stack:

custom 5.6 kernel with ZFS and DRBD9 on both PCs.
one zvol in a partition of each local disk of each PC - compression enabled, dedup disabled (tried to enable compression, everything hangs badly)
dual primary DRBD9 to mirror between them
OCFS2 on top, to mount the resulting device on both PCs

A third very old machine acts as DRBD arbitrator with no actual disk space participating to DRBD mirror.

A second switch and a second NIC is coming to improve availability.

I would like to understand if there is simpler stack to achieve the same result. I already discarded some options, based on my current knowledge: Lustre (too complex for small environments), BeeGFS (not updated), GlusterFS (does not work with raw devices, only with mounted folders)

EDIT - I've been asked to focus on one question. As the first one has been answered, I kept the second.

Zvol on drbd is probably not a great idea as zfs wants access to the raw disk for best reliability. You must be taking a significant performance hit using drbd with dual primary on gigabit switches. — davidgo, Jun 15 '20 at 10:02
At a block level have you considered MARS (like drbd+drbd proxy, but free - unfortunately not great kernel support). Also, ZFS replication.Also probably not what you want, but have you looked at MooseFS? — davidgo, Jun 15 '20 at 10:06
yes, speed is not the best, but 10G is out of budget. A second gigabit link is the maximum I can afford. MooseFS looks as complex as Lustre in terms of roles to set up and configure. And yes, ZFS replication is not the correct tool fot the job. — Qippur, Jun 15 '20 at 10:24

score 3 · Accepted Answer · answered Jun 15 '20 at 10:59

You are conflating Cluster Filesytems with Distributed Filesystem.

What you achieved with your ZVOL+DRBD+OCFS2 setup is a "shared-nothing" clustered filesystem, where DRBD emulates a true shared-block SAN and OCFS2 (or GFS2) provides multiple concurrent mounts by multiple head nodes. In this configuration, you can not swap layer 2 and 3 (ie: DRBD+ZVOL+OCFS2) because ZFS is not a cluster filesystem - if mounted on two different hosts, it will very quickly corrupt itself (this is true even for ZVOLs, which are little more than hidden files in the root ZFS dataset).

Lustre, Gluster, Ceph, etc. are distributed filesystem: they use individual filesystem/fiels/databases on each hosts, which are combined at userspace level as a single, multiple-hosts-spanning (ie: distributed) filesystem.

How can you select between the two approaches? It depends on multiple factors:

if cold, async replication is sufficient, you can use zfs send/recv and call it a day
if true realtime replication is required but no hard/immediate HA is needed and manual failover is an option, you can use DRBD in single-primary mode and completely skip the overhead of a cluster filesystem (ie: using a plain XFS rather than OCFS2/GFS2)
if used for big file store (ie: vm images) and with only an handful of hosts, your current approach is probably the best one (at the cost of added complexity and reduced performance). If having many nodes, GlusterFS (with the right options - sharding being the first one) can be a reasonable choice, but be sure to follow the mailing list (it has many gotcha)
if you need a "large NAS" for storing many medium-sized files (1-128M), GlusterFS in replica mode can be the right choice (again, be sure to follow the mailing list)
if having many nodes and large sysadmin resource (read: a dedicated team) you can consider Lustre or Ceph, which are the higher-end options in distributed filesystem.

I strongly advise you to keep things as simple as possible, even in the face or reduced availability (unless you really need it): storage administration is a complex task, which requires a profound understanding of all the moving parts to avoid burning yourself (and eating your data).

NOTE: you can read here for complementary informations

Actual goal is to scale out some service. MySQL, redis, and similar software implement the option natively. PostgrSQL has a more cumbersome approach. Apache can be balanced easily with itself or HAProxy and keepalived. Filesystem is the last bit. I really don't like (personal taste) glusterFS need for an existing mounted folder that could be accessed outside gluster knowledge by mistake. — Qippur, Jun 15 '20 at 11:14
@Qippur MySQL, redis and the likes have their own cluster engines; I would use them, rather than running a database on a cluster or distributed filesystem. — shodanshok, Jun 15 '20 at 12:44
Yes, this is what I am doing for each service which has native clustering/high availability feature. I likely did not state it clearly in my previous post. The only missing bit to settle is the filesystem. — Qippur, Jun 15 '20 at 13:14

cheap, commodity hardware based, highly available distributed filesystem

1 Answers1