0

It is an old datum that it is not possible to use a non-cluster-aware filesystem like ext4 on Linux with DRBD in dual-primary mode.

For example, as stated by Linbit in their manual "Dual Primary - think twice":

DRBD replicates the changes from node A to node B and the other way around. 
It changes the contents of the physical storage device. However - as DRBD resides 
under the mentioned Ext4 filesystems, the filesystem on the physical disk of 
node A does not notice the changes coming from node B (and vice versa). 
This process is called a concurrent write. Starting from now, the actual content 
of the storage device differs from what the filesystem there thinks it should be. 
The filesystem is corrupt."

My question is - why is this?

Because, if the METADATA of that file system is stored on the same DRBD device, any change like the one described above would be synced between the two DRBD nodes as well, and so the file systems on both ends (which consist of data + metadata, don't they?) are fully in sync. True, what node 1 wrote has been overwritten by node 2, but if I issue a "dir" command on node 1, I would see there is another file than node 1 just copied. The same happens on simple shared folders such as Windows CIFS shares. This does not render the file system corrupt.

So where is the problem? Why is everyone saying the file system will be corrupt? Does it mean the ext4 file systems do NOT store metadata on the actual device itself but store it elsewhere, such as in the root file system? Per what I can read on the internals of the ext4 FS this is not the case. (I have to say I haven't gone into too deep details on ext4).

But it should be more or less like this:

Node1 writes a new file to block 34098 (and updates the directory entry as well):

Node 1
 - Directory Entry: /data/myfile1.txt  34098
 -----> block 34098 contains: myfile1.txt

At the "same time", Node2 writes the following to block 34098. It can never be "at the same time", so we assume it is when DRBD has just completed above sync.

Node2
 - Directory Entry: /data/other.txt  34098
 -----> block 34098 contains: other.txt

DRBD should now sync again the block 34098 back to node1, both the directory entry and the block 34098.

Along with writing the file "other.txt" to blocck 34098, the file system on node2 will also update the block containing the directory entry (which is just another file) pointing to block 34098. So it should always be in sync, or not?

nepdev
  • 391
  • 1
  • 7
  • 21

1 Answers1

3

The kernel has an in-memory image of the state it thinks the file system is in and it doesn't check the disk to see if it might have changed, because this can't happen, as only the local kernel is allowed to change the file system and it knows what it does and doesn't need to check. If you make changes on the second node, the on-disk structures will be different from what the kernel expects and data-loss is nearly guaranteed.

And since cluster-aware file systems add quite a lot of synchronization and checks to the picture to avoid all kind of problems, it's not as easy as letting the kernel read the file system before every operation to make e.g. ext4 cluster capable.

Sven
  • 98,649
  • 14
  • 180
  • 226
  • To put it in scope, the behavior is as described by Sven with any non-cluster file system. There is an own breed of file systems which have been designed for high availability applications with exactly this condition in mind, they are called "shared-disk clustered file system". See http://en.wikipedia.org/wiki/Clustered_file_system for details. – the-wabbit Nov 28 '13 at 15:10
  • Thanks SvW, this explains it. As a consequence, it would mean that even if you do active/standby, you will have to re-mount the file system in the moment the primary node fails and the secondary node has to take over. Because the kernel on the standby would not have an up-to-date picture of what is in the file system, right? – nepdev Nov 28 '13 at 15:14
  • The problem is partly due to locking, or rather the lack of cluster aware locking. App1 on node1 isn't aware that App2 on node2 has opened the same file and both write modifications, the winner will be in unpredictable and that makes for corruption as well. – HBruijn Nov 28 '13 at 15:15
  • Just edited - see above – nepdev Nov 28 '13 at 15:20
  • 1
    You can't mount a resource in DRBD secondary state anyway, and certainly not in read-write mode. Your fail-over script has to make sure that the resource is switched to primary on the surviving node and mounted afterwards. And you have to make sure your STONITH works. – Sven Nov 28 '13 at 15:52