MongoDB and ZFS bad performance: disk always busy with reads while doing only writes

Question

I have huge performance issues using MongoDB (i believe it is mmapped DB) with ZFSonlinux.

Our Mongodb is almost only writes. On replicas without ZFS, disk is completely busy for ~5s spikes, when app writes into DB every 30s, and no disk activity in between, so i take that as the baseline behaviour to compare.
On replicas with ZFS, disk is completely busy all the time, with the replicas stuggling to keep up to date with the MongoDB primary. I have lz4 compression enabled on all replicas, and the space savings are great, so there should be much less data to hit the disk

So on these ZFS servers, i first had the default recordsize=128k. Then i wiped the data and set recordsize=8k before resyncing Mongo data. Then i wiped again and tried recordsize=1k. I also tried recordsize=8k without checksums

Nevertheless, it did not solved anything, disk was always kept a 100% busy. Only once on one server with recordsize=8k, the disk was much less busy than any non-ZFS replicas, but after trying different setting and trying again with recordsize=8k, disk was 100%, i could not see the previous good behaviour, and could not see it on any other replica either.

Moreover, there should be almost only writes, but see that on all replicas under different settings, disk is completely busy with 75% reads and only 25% writes

(Note, i believe MongoDB is mmapped DB. I was told to try MongoDB in AIO mode, but i did not find how to set it, and with another server running MySQL InnoDB i realised that ZFSonLinux did not support AIO anyway.)

My servers are CentOS 6.5 kernel 2.6.32-431.5.1.el6.x86_64. spl-0.6.2-1.el6.x86_64 zfs-0.6.2-1.el6.x86_64

#PROD 13:44:55 root@rum-mongo-backup-1:~: zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
zfs                      216G  1.56T    32K  /zfs
zfs/mongo_data-rum_a    49.5G  1.56T  49.5G  /zfs/mongo_data-rum_a
zfs/mongo_data-rum_old   166G  1.56T   166G  /zfs/mongo_data-rum_old

#PROD 13:45:20 root@rum-mongo-backup-1:~: zfs list -t snapshot
no datasets available

#PROD 13:45:29 root@rum-mongo-backup-1:~: zfs list -o atime,devices,compression,copies,dedup,mountpoint,recordsize,casesensitivity,xattr,checksum
ATIME  DEVICES  COMPRESS  COPIES          DEDUP  MOUNTPOINT               RECSIZE         CASE  XATTR   CHECKSUM
  off       on       lz4       1            off  /zfs                        128K    sensitive     sa        off
  off       on       lz4       1            off  /zfs/mongo_data-rum_a         8K    sensitive     sa        off
  off       on       lz4       1            off  /zfs/mongo_data-rum_old       8K    sensitive     sa        off

What could be going on there ? What should i look to figure out what ZFS is doing or which setting is badly set ?

EDIT1:
hardware: These are rented servers, 8 vcores on Xeon 1230 or 1240, 16 or 32GB RAM, with zfs_arc_max=2147483648, using HP hardware RAID1. So ZFS zpool is on /dev/sda2 and does not know that there is an underlying RAID1. Even being a suboptimal setup for ZFS, i still do not understand why disk is choking on reads while DB does only writes.
I understand the many reasons, which we do not need to expose here again, that this is bad and bad, ... for ZFS, and i will soon have a JBOD/NORAID server which i can do the same tests with ZFS's own RAID1 implementation on sda2 partition, with /, /boot and swap partitions doing software RAID1 with mdadm.

We're missing the details of your storage hardware configuration. Please provide that. — ewwhite, Mar 21 '14 at 14:17
More information. What type of server is it? Specific model. Do you have a write cache on the RAID controller? — ewwhite, Mar 21 '14 at 15:34
How big is the data set? basically Mongo recommends that the data set fits in RAM otherwise use sharding across servers. It also recommends a "readahead size of 32 or the size of most documents" http://info.mongodb.com/rs/mongodb/images/MongoDB-Performance-Considerations_2.4.pdf — LinuxDevOps, Mar 21 '14 at 15:47
May I add, what are you gaining by using ZFS in this case? Is it just about compression? If so, maybe a `zvol` approach with a supported filesystem on top would make more sense. — ewwhite, Mar 25 '14 at 14:31
Just a side note, always remember ZFSonlinux ARC memory is badly accounted(leading to fragmentation), so real used memory can be as high as 200% of the maximum you've set. — mveroone, Mar 25 '14 at 14:34
@ewwhite, yes compression is a bonus, since iowait and disk is quite busy. But the real need here is ability to snapshot frequently and send snapshot diffs. I had been using XFS for Mongo for a year because it had xfs_freeze which i could use to do LVM snapshots or consistent long mongodump, but now there is too much data — Alex F, Mar 26 '14 at 21:43

score 6 · Answer 1 · answered Mar 25 '14 at 14:22

6

First off, it's worth stating that ZFS is not a supported filesystem for MongoDB on Linux - the recommended filesystems are ext4 or XFS. Because ZFS is not even checked for on Linux (see SERVER-13223 for example) it will not use sparse files, instead attempting to pre-allocate (fill with zeroes), and that will mean horrendous performance on a COW filesystem. Until that is fixed adding new data files will be a massive performance hit on ZFS (which you will be trying to do frequently with your writes). While you are not doing that performance should improve, but if you are adding data fast enough you may never recover between allocation hits.

Additionally, ZFS does not support Direct IO, so you will be copying data multiple times into memory (mmap, ARC, etc.) - I suspect that this is the source of your reads, but I would have to test to be sure. The last time I saw any testing with MongoDB/ZFS on Linux the performance was poor, even with the ARC on an SSD - ext4 and XFS were massively faster. ZFS might be viable for MongoDB production usage on Linux in the future, but it's not ready right now.

answered Mar 25 '14 at 14:22

Adam C

5,222
2
30
52

Good information here. – ewwhite Mar 25 '14 at 14:30
1

FYI: I came across similar issues with MySQL over zfsOL. DirectIO had to be disabled leading to poor performance, but not "that" bad. (Debian 6, vmware VM) – mveroone Mar 25 '14 at 14:37
I had been using XFS for Mongo for a year because it had xfs_freeze which i could use to do LVM snapshots or consistent long mongodump, but now there is too much data. That is why i began searching and found ZFS was now stable on Linux, and had fast snapshots and differential sends, which was heaven – Alex F Mar 26 '14 at 21:36
@adam-c What does "checked" mean "ZFS is not even checked" ? Do you simply mean that the Mongo server is testing on which filesystem its data files are located to choose an efficient storage mechanism, but this had not been implemented for ZFS on Linux ? (sorry, i am not a native english speaker and i find that ambiguous to understand) – Alex F Mar 26 '14 at 21:47
@AlexF - yes, basically - the server ticket I pointed to is the one that will fix that particular issue for ZFS on Linux. You can see a similar check for NFS here already, so it should not be a huge change, but there are likely to be other issues: https://github.com/mongodb/mongo/blob/v2.4/src/mongo/util/file_allocator.cpp#L142 – Adam C Mar 27 '14 at 01:45
Even though XFS and ext4 are tested and recommended, they do not provide COW and data checksumming (which partly explains why they are faster on some workloads). In addition, Direct IO is the wrong way of doing things IMHO. Why **not cache** the very same file I'm already writing and re-read it later? – Fabio Scaccabarozzi Mar 28 '14 at 12:52
@AdamC You first point (pre-allocation) shouldn't that much affect the OP given the fact he enabled ZFS compression. In such case, ZFS doesn't write or even allocate data blocks if they need to contain only zeroes. The files are then implicitly sparse by design. – jlliagre Mar 28 '14 at 13:22
@jlliagre - yes, but as soon as you start actually changing the file you are going to see a performance hit, because you invalidate the whole lot – Adam C Mar 28 '14 at 14:06
1

Invalidating unallocated blocks is an extremely light operation. There is no difference here between ZFS and other file systems. You wrote there is a performance issue due to not using sparse files. This doesn't apply to the OP. I do not question the fact there might be a performance impact due to using a COW file system but this is unrelated to file sparseness. – jlliagre Mar 28 '14 at 17:11

ewwhite · Accepted Answer · 2014-05-12T11:55:21.130

This may sound a bit crazy, but I support another application that benefits from ZFS volume management attributes, but does not perform well on the native ZFS filesystem.

My solution?!?

XFS on top of ZFS zvols.

Why?!?

Because XFS performs well and eliminates the application-specific issues I was facing with native ZFS. ZFS zvols allow me to thin-provision volumes, add compression, enable snapshots and make efficient use of the storage pool. More important for my app, the ARC caching of the zvol reduced the I/O load on the disks.

See if you can follow this output:

# zpool status
  pool: vol0
 state: ONLINE
  scan: scrub repaired 0 in 0h3m with 0 errors on Sun Mar  2 12:09:15 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol0                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243223  ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243264  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243226  ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243185  ONLINE       0     0     0

ZFS zvol, created with: zfs create -o volblocksize=128K -s -V 800G vol0/pprovol (note that auto-snapshots are enabled)

# zfs get all vol0/pprovol
NAME          PROPERTY               VALUE                  SOURCE
vol0/pprovol  type                   volume                 -
vol0/pprovol  creation               Wed Feb 12 14:40 2014  -
vol0/pprovol  used                   273G                   -
vol0/pprovol  available              155G                   -
vol0/pprovol  referenced             146G                   -
vol0/pprovol  compressratio          3.68x                  -
vol0/pprovol  reservation            none                   default
vol0/pprovol  volsize                900G                   local
vol0/pprovol  volblocksize           128K                   -
vol0/pprovol  checksum               on                     default
vol0/pprovol  compression            lz4                    inherited from vol0
vol0/pprovol  readonly               off                    default
vol0/pprovol  copies                 1                      default
vol0/pprovol  refreservation         none                   default
vol0/pprovol  primarycache           all                    default
vol0/pprovol  secondarycache         all                    default
vol0/pprovol  usedbysnapshots        127G                   -
vol0/pprovol  usedbydataset          146G                   -
vol0/pprovol  usedbychildren         0                      -
vol0/pprovol  usedbyrefreservation   0                      -
vol0/pprovol  logbias                latency                default
vol0/pprovol  dedup                  off                    default
vol0/pprovol  mlslabel               none                   default
vol0/pprovol  sync                   standard               default
vol0/pprovol  refcompressratio       4.20x                  -
vol0/pprovol  written                219M                   -
vol0/pprovol  snapdev                hidden                 default
vol0/pprovol  com.sun:auto-snapshot  true                   local

Properties of ZFS zvol block device. 900GB volume (143GB actual size on disk):

# fdisk -l /dev/zd0

Disk /dev/zd0: 966.4 GB, 966367641600 bytes
3 heads, 18 sectors/track, 34952533 cylinders
Units = cylinders of 54 * 512 = 27648 bytes
Sector size (logical/physical): 512 bytes / 131072 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x48811e83

    Device Boot      Start         End      Blocks   Id  System
/dev/zd0p1              38    34952534   943717376   83  Linux

XFS information on ZFS block device:

# xfs_info /dev/zd0p1
meta-data=/dev/zd0p1             isize=256    agcount=32, agsize=7372768 blks
         =                       sectsz=4096  attr=2, projid32bit=0
data     =                       bsize=4096   blocks=235928576, imaxpct=25
         =                       sunit=32     swidth=32 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=65536, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

XFS mount options:

# mount
/dev/zd0p1 on /ppro type xfs (rw,noatime,logbufs=8,logbsize=256k,nobarrier)

Note: I also do this on top of HP Smart Array hardware RAID in some cases.

The pool creation looks like:

zpool create -o ashift=12 -f vol1 wwn-0x600508b1001ce908732af63b45a75a6b

With the result looking like:

# zpool status  -v
  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 0h14m with 0 errors on Wed Feb 26 05:53:51 2014
config:

        NAME                                      STATE     READ WRITE CKSUM
        vol1                                      ONLINE       0     0     0
          wwn-0x600508b1001ce908732af63b45a75a6b  ONLINE       0     0     0

That is interesting and worth trying. Your zpool creation cmd line reminded me that my first tests that showed small disk activity were with ashift=12. My latest tests with disastrous iowait were without, because i did not know what were the disks behind the hardware raid. Maybe that is a big part of the solution — Alex F, Mar 30 '14 at 21:30
@AlexF I use ashift=12 for most disk solutions (RAID controller or no), ashift=13 for PCIe SSD. — ewwhite, Mar 30 '14 at 21:33
@AlexF: I agree that sector alignment might be another possible issue. Unaligned sector access is several times slower for 4k hard drives (almost any HDD currently sold today). If your ZFS partition is not aligned on a multiple-of-8 sector boundary your ZFS volume may trash disk accesses, because every write would then incour in the unaligned write penalty. Unfortunately many disks will report 512 byte sectors, though they use 4096 byte physical ones. Usually I always use 4k alignment by default, very little space gets sacrificed, and 512byte drives won't notice the difference. — Fabio Scaccabarozzi, Mar 31 '14 at 10:52
Now testing with ashift=12, prefetching disabled and arc_max at 12GB, and xfs on zvol, and it seems great so far. I still need to test it on zfs's own raid 1. — Alex F, Apr 01 '14 at 18:44
Make sure your zvol block size is 128k on Linux. The default 8k is terrible. — ewwhite, Apr 01 '14 at 18:52
@ewwhite Can you explain why the -o volblocksize=128K and the logbufs=8,logbsize=256k mount options, if you made a special calculation for it ? I also see nobarrier, so i guess your raid has a battery ? — Alex F, Apr 01 '14 at 18:52
The volblocksize is critical for ZFS on Linux zvols. Your performance will be abysmal without it. The logsbufs and block size mount options for XFS are just performance based. Benchmark without them if you want. The volblocksize for ZFS needs to be done at creation time, so don't mess that up. — ewwhite, Apr 01 '14 at 19:06
@AlexF Excellent. What did the final configuration end up being? — ewwhite, Apr 07 '14 at 21:47
I managed to get NORAID/JBOD on the rented server. Configuration is soft raid1 with mdadm except for one, which is of course ZFS's own raid1. — Alex F, Apr 09 '14 at 18:37
I managed to get NORAID/JBOD on the rented server. Configuration is soft raid1 with mdadm except for one, which is of course ZFS's own raid1 with ashift=12. Prefetching disabled, ARC max 1/3 of RAM, compress lz4, zfs_txg_timeout=5. XFS Zvol created with volblocksize=128K, mounted with noatime,logbufs=8,logbsize=256k as well. I can fsyncLock Mongo(), xfs_freeze and zfs snapshot. I automatically send snapshots diffs hourly to the staging machine so that it verifies content by mounting it and we are confident that snapshots are correct. I also clone the snap, mount+fsck, and send to S3. — Alex F, Apr 09 '14 at 18:47

score 5 · Answer 3 · answered Apr 29 '16 at 12:42

5

We were looking into running Mongo on ZFS and saw that this post raised major concerns about the performance available. Two years on we wanted to see how new releases of Mongo that use WiredTiger over mmap, performed on the now officially supported ZFS that comes with the latest Ubuntu Xenial release.

In summary it was clear that ZFS doesn't perform quite as well as EXT4 or XFS however the performance gap isn't that significant, especially when you consider the extra features that ZFS offers.

I've made a blog post about our findings and methodology. I hope you find it useful!

answered Apr 29 '16 at 12:42

Owen Garland

61
1
1

Thank you! The blog post link is not working anymore. – Phil Krylov Nov 16 '17 at 16:51
It looks like it has been reposted there: http://owen.cymru/mongodb-performance-on-zfs-and-linux/ – Alex F Feb 14 '18 at 15:20
@owen-garland I am the author of the original question and problem you referenced in http://owen.cymru/mongodb-performance-on-zfs-and-linux/ . I found the post while i was investigating that again for a new customer _(i am freelance sysadmin/devops/ops/whatever)_ and i am glad you now have comparable performance between ZFS and XFS/Ext4. The original issue was occurring with much earlier version of MongoDB which was using `mmap` files at that time. I guess from your benchmarks that MongoDB 3 using `WiredTiger` engine works well with ZFS. By the way, did you try ZFS block mode with XFS ontop? – Alex F Feb 14 '18 at 15:25

Fabio Scaccabarozzi · Answer 4 · 2014-03-28T13:29:36.980

I believe your disk is busy doing reads because of the

zfs_arc_max=2147483648

setting. Here you are explicitly limiting the ARC to 2Gb, even though you have 16-32Gb. ZFS is extremely memory-hungry and zealous when it comes to the ARC. If you have non-ZFS replicas identical to ZFS replicas (HW RAID1 underneath), doing some maths yields

5s spike @ (200Mb/s writes (estimated 1 hdd throughput) * 2 (RAID1)) = 2Gb over 5sec

which means you are probably invalidating the whole ARC cache in 5 seconds time. ARC is (to some degree) "intelligent" and will try to retain both the most recently written blocks and the most used ones, so your ZFS volume may well be trying to provide you a decent data cache with the limited space it has. Try raising zfs_arc_max to half of your RAM (or even more) and using arc_shrink_shift to more aggressively evict ARC cache data.

Here you can find a 17-part blog reading for tuning and understanding ZFS filesystems.

Here you can find the ARC shrink shift setting explaination (first paragraph), which will allow you to reclaim more ARC RAM upon eviction and keep it under control.

I'm unsure of the reliability of the XFS on zvol solution. Even though ZFS is COW, XFS is not. Suppose that XFS is updating its metadata and the machine loses power. ZFS will read the last good copy of the data thanks to the COW feature, but XFS won't know of that change. Your XFS volume may remain "snapshotted" to the version before the power failure for an half, and to the version after power failure for the other (because it's not known to ZFS that all that 8Mb write has to be atomic and contains inodes only).

[EDIT] arc_shrink_shift and other parameters are available as module parameters for ZFSonlinux. Try

modinfo zfs

to get all the supported ones for your configuration.

I didn't realize his ARC was set so low... But power protection is the easiest thing to account for in server solutions. It's a nonstarter here. The OP's systems appear to be co-located. — ewwhite, Mar 28 '14 at 12:55
Just keep in mind my warning about ARC limit. It WILL use more memory than you give it. — mveroone, Mar 28 '14 at 13:40
Unless you use deduplication or gigantic pools, ZFS is not especifically memory hungry. All file systems will use as much as RAM as they can to cache data, and this is good as unused RAM is wasted RAM. The difference is ZFS cache is not reported as filecache but as used. ZFS will release that memory should there is demand for it so this shouldn't be a problem unless huge demand occur ZFS cannot cope with fast enough. — jlliagre, Mar 28 '14 at 17:20
I will try to raise ARC to 12GB to see. MongoDB is also memory hungry — Alex F, Mar 30 '14 at 21:23
Disabling prefetching improved a lot. It reversed my previous ratio of 5 reads for 1 write to 1 read to 3 writes (Sar data). Much better for my write intensive app to Mongo — Alex F, Apr 01 '14 at 18:48

Gordan Bobić · Answer 5 · 2020-07-05T12:09:25.297

What are your ZFS settings, particularly primarycache, logbias and sync?

Make sure primarycache=all, logbias=throughput.

sync=disabled will give you a significant speedup in writes but may lose you up to 5 seconds of most recently written data if a crash occurs. Given the symptoms you are describing, you may also want to disable ZFS prefetch.

I wrote an article based on a talk I gave a while back about running MySQL on ZFS which you may find helpful.

MongoDB and ZFS bad performance: disk always busy with reads while doing only writes

5 Answers5