Filesystem Test for Contended Writes

Question

I have been using a suite of filesystem testing tools to benchmark and abuse a GlusterFS volume. The volume is a replica 3 volume spread out over 6 hosts.

Fio, iozone, and Bonnie indicate to me that Gluster is working just fine and the bandwidth is roughly equal to that of the client and server network adapters, so performance can't really be improved. Most of my test cases operated on 32gb files, apart from iozone and Bonnie.

I have gotten reports of split brain occurring for certain files which are being concurrently written to by multiple clients. All of the documentation I have read seems to indicate that split brain largely occurs when network partitions happen, and this is clearly not the case, judging from the logs.

Unfortunately, this split brain seems to occur only when using a certain hosted service, and I have zero introspection into how that service operates, what version of Gluster client it has, etc. The servers are running the latest 4.0 release.

Judging from the failure case I have been presented with ("split brain happens when two containers are writing to the same file at the same time"), I need a test that will reproduce a similar situation.

I could definitely write my own test case in C or Rust, but is there something out there which will test this exact case without having to write anything?

I do have access (but not introspection) into this hosted service, so I will probably test that too. I'm also scratching my head at the actual problem: what is the desired outcome when two programs write different data to the same file at the same time?

EDIT: The servers are running the latest CentOS 7 release. My testing client server is also running the same. The underlying filesystem is XFS.

Is there a specific test case that I can use to try to recreate the problem?

Do the gluster bricks not agree, or does the application not do file locking correctly? The former might be visible from `gluster volume heal info split-brain`. — John Mahowald, Jul 02 '18 at 02:53
@JohnMahowald thanks for your reply, I have updated the question with more information. Unfortunately, I do not have access to the code causing the problem, but it's a PHP script writing log files so it's probably not doing any locking, it's probably just `fopen` in append, `fwrite`, then `fclose`. Apparently no problems are had on regular NFS. If you have any ideas on how I could write a test to reproduce this, I'm all ears :) — Naftuli Kay, Jul 02 '18 at 16:47
@JohnMahowald after consulting PHP engineers and the source code, it appears that `error_log` does not `flock(2)`, which is likely the cause of this issue. It doesn't really manifest on NFS because NFS will just allow the corruption whereas GlusterFS is likely trying to prevent it. Without filesystem locking, I can't see how GlusterFS would do the "right" thing in this circumstance. — Naftuli Kay, Jul 03 '18 at 20:44

score 3 · Answer 1 · answered Jul 04 '18 at 15:51

Sounds like you have a PHP app and its error log is getting corrupted. So the most realistic test would be to job off multiple PHP processes, which are in parallel calling error_log().

You could trace the app doing error log or read the source code to find out its precise implementation. Particularly interesting would be if it opens in append mode with O_APPEND. Append has race conditions on NFS, so this does not necessarily fix the problem on network file systems.

Consider switching error_log to syslog and letting your syslogd forward on to a central syslog instead. That will convert it to a single file writer. Or you can forward to a log analytics platform like Graylog, ELK, or Splunk, which have proper databases.

I looked into it and the error log in PHP does not perform file locking, rather the FPM process will use a mutex to ensure only one thing is writing at a time. I will try to produce an example in C to reproduce this from two client nodes. It's likely that NFS just fails softly while Gluster is trying to preserve data integrity at all costs. — Naftuli Kay, Jul 04 '18 at 21:29

Anon · Answer 2 · 2018-07-28T15:20:39.790

Just create two separate fio jobs that are doing direct I/O to the same file which is controlled by the filename parameter. Make the size of the file somewhat small and perhaps have one or both of the fio jobs do the write I/O randomly and perhaps set each job to use a different blocksize. Bonus points for using fio's client/server mode so the jobs come from different machines. Use runtime and time_based to keep fio looping.

Filesystem Test for Contended Writes

2 Answers2