12

I would like to understand what is the best solution for realtime replication between two ZFS on Linux (ZoL) boxes connected by a 10 GbE link. The goal is to use them for virtual machines; only one box at a time will run the virtual machines and the ZFS filesystem itself. Snapshot need to be possible on the first (active) box. I plan to use enterprise/nearline grade SATA disks, so dual-port SAS disks are out of question.

I thought at the following possibilities:

  • use iSCSI to export the remote disks and make a mirror between the local box's ZFS disks and the remote iSCSI disks. The bigger appeal of this solution is its simplicity, as it uses ZFS own mirroring. On the other side, ZFS will not give priority to the local disks over the remote ones, and that can cause some performance degradation (barely relevant on a 10 GbE network, I suppose). Moreover, and cause of bigger concern, is how ZFS will behave in case of network link loss between the two boxes. Will it re-sync the array when the remote machine become available, or manual intervention will be required?
  • use DRBD to synchronize two ZVOLS and lay ZFS on top of the DRBD device. In other words, I'm speaking about a stacked ZVOL + DRBD + ZFS solution. This seems the preferred approach to me, as DRBD 8.4 is very stable and proven. However, many I/O layers are at a play here and performance may suffer.
  • use plain ZFS + GlusterFS on top. From ZFS standpoint, this is the simpler/better solution, as all replication traffic is delegated to GlusterFS. Do you found GlusterFS stable enough?

What do you feel is the better approach? Thanks.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • Are you certain you need realtime replication? What problem does it solve? For the use case you mentioned, I either use a clustered dual-node shared SAS setup or asynchronous replication on 15 or 30-second intervals. – ewwhite Apr 24 '17 at 15:53
  • The goal is to have a perfectly-synchronized backup/standby machine. Surely I can use very frequent (ie: each 60 seconds) snapshots and send/receive, and it will be very reasonable, but realtime replication will be even better (and I already have some such installations using DRBD + LVM, rather than ZFS). In order to use enterprise/nearline grade SATA disks, I would prefer not to depend on dual-port SAS disks (albeit this is a good solution by itself). I'll add these informations on the main post. – shodanshok Apr 24 '17 at 16:57

1 Answers1

7

I recommend a clustered dual-node shared SAS setup or continuous asynchronous replication on 15 or 30-second intervals. The latter is good for continuity, while the latter provides a way to obtain geographic separation. They can be used together.

However, if you want to experiment, you can use Infiniband SRP or 100GbE RDMA to create a ZFS mirror between your two nodes.

For example, node1 and node2, each have local disk (assume hardware RAID) and present that local storage over SRP. One node is in control of the zpool at a time, and that pool is comprised of node1's local disks and node2's remote disk.

Your mirroring is synchronous because it's a ZFS mirror. Failover and consistency is handled by normal resilvering behavior. Zpool import/ownership/export is handled by Pacemaker and the standard cluster utilities...

Or you can use a commercial solution that does the same. See:

http://www.zeta.systems/blog/2016/10/11/High-Availability-Storage-On-Dell-PowerEdge-&-HP-ProLiant/

ewwhite
  • 197,159
  • 92
  • 443
  • 809
  • Very interesting, thank you. Have you something similar in production? Can you comment on manageability / reliability? Thanks. – shodanshok Apr 24 '17 at 18:38
  • I recommend shared storage enclosures or having replication. I don't think synchronous mirroring solves every problem, so it's not my preference. You can talk to the guys at Zeta Systems, as they developed the solution. – ewwhite Apr 24 '17 at 18:44
  • "The latter is good for ..., while the latter ...". I presume you meant to say "former" in one of those instances? – Kenny Rasschaert Dec 19 '18 at 10:54
  • @KennyRasschaert Probably. – ewwhite Dec 19 '18 at 14:03