I have been experiencing data corruption when writing data to a replicated GlusterFS volume I have configured across two servers.
The configuration I have set up is as follows:
- Servers are running Ubuntu 16.04 and GlusterFS v3.10.6
- Clients are running Ubuntu 14.04 and GlusterFS v3.10.6
- Two volumes in GlusterFS have been configured, each with two bricks distributed with one brick on each server.
- Each brick is a MDADM RAID5 array with a EXT4/LUKS file system.
Each volume is configured with the default options, plus bitrot detection. These are as follows:
features.scrub: Active features.bitrot: on features.inode-quota: on features.quota: on nfs.disable: on
The data corruption manifests it's self when large directories are copied from the local file system on one of the client machines to either of the configured GlusterFS volumes. When md5 checksums are calculated for the copied files and source files and the two are compared, a number of the checksums differ.
Manually triggering a self-heal on either GlusterFS volumes shows no files identified for healing. Additionally, looking at the output from gluster volume bitrot <volname> scrub status
and the output logs in /var/log/glusterfs/bitd.log
and /var/log/glusterfs/scrub.log
don't seem to identify any errors.
These issues have only manifested themselves, recently after around a week of both volumes being used fairly heavily by ~10 clients.
I have tried taking the volumes offline and have tested writing data to each of the bricks directly via the underlying local file system and haven't been able to reproduce the issues.
To further debug the issue I have configured a similar setup on VMs in VirtualBox and haven't been able to reproduce the problem. I am therefor at rather a loss as to what may be these cause of these errors.
Any suggestions of further debugging steps I could take or known issues with GlusterFS and my configuration would be appreciated.