Uneven disk load with mdadm RAID5 and ext4 filesystem with journal

Question

Recently I created a RAID5 with mdadm:

mdadm --create /dev/md0 -l 5 -n 4 -c 512 /dev/sdb /dev/sdc /dev/sdd /dev/sde

The usual tuning to speed up the initial sync:

echo 32768 > /sys/block/md0/md/stripe_cache_size

Then I left it alone to finish syncing.

Next, I created and tuned a file system on the device, optimized for a few huge files:

mke2fs -t ext4 -e remount-ro -Elazy_journal_init=0,lazy_itable_init=0,stride=128,stripe_width=384 -i 524288 /dev/md0
tune2fs -r0 -c0 -i12m -o ^acl,journal_data_writeback,nobarrier /dev/md0

I forced the ext4 data structure writeout at mkfs-time to prevent false benchmarks through background initialization. The options from /etc/mke2fs.conf are from Debian 9 and untouched by me.

Then I mounted this filesystem:

mount -o mand,nodev,stripe=1536,delalloc,auto_da_alloc,noatime,nodiratime /dev/md0 /mnt

Everything's fine, so far.

When I write (big) files to this filesystem, iostat -x 2 shows that one disk is loaded at 100% and the rest mostly idle.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00   73.50  539.00   294.00  2155.25     8.00   146.37  238.01  188.30  244.79   1.63 100.00
sdc               0.00     0.00    4.50  545.00    18.00  2179.25     8.00     2.92    5.31    3.56    5.32   0.08   4.40
sdd               0.00     0.00    2.50  545.50    10.00  2181.25     8.00     2.90    5.30    4.00    5.31   0.09   4.80
sde               0.00     0.00   33.50  514.50   134.00  2057.25     8.00     2.96    5.39    0.12    5.74   0.07   4.00
md0               0.00     0.00    0.00   67.50     0.00 56740.00  1681.19     0.00    0.00    0.00    0.00   0.00   0.00

When I do all these steps again but omit creating a journal (mke2fs -O^has_journal as additional parameter) disk load is spread evenly across all disks. So it seems the journal doesn't get spread all over the disks.

How can I benefit from a journal while retaining the ability to get more speed by loading all disks (more or less) evenly? Is this even possible while forcing all data through the journal with journal_data_writeback?

I thought about externalizing the journal, but where should I place it? A RAM-Disk is volatile, not good. Years ago, there were true DRAM-based solid state disks with battery backup available but it seems these have all been replaced by flash based SSD media. DRAM has no drawbacks with a mostly write-oriented load.

Addendum: The journal on disk is 1024M, according to this article. So it clearly shouldn't be a locality problem considering only size.

Check the smart counters for that disk. I bet it's bad or failing. — longneck, Oct 13 '18 at 13:32
The disks are all rather new and SMART counters (retries, etc.) are alike. `smartctl -H` result is *passed*. `dmesg` is clean, no errors logged. The AHCI-Controller had an issue with my previous TRIM-experiments, though: Accesses were really painfully slow. A reboot solved that issue. The uneven load remained until I took the above step and omitted the journal at all. — PoC, Oct 14 '18 at 14:25
That's really weird, because in RAID5, you should not be able to cause a locality utilization problem like that; all writes should involve a full stripe, which is by definition all disks. As I'm typing this, drive caches and NCQ come to mind. Can you check that all drives are on the same firmware, and that they all have the same on board cache size, and the on board write caches are all either enabled or disabled? — longneck, Oct 15 '18 at 13:41
It is clearly not a hardware nor firmware issue. See above: Omitting the Journal solves the issue. (Additionally: Omitting shoving data writes through the journal does, also.) No different controller, same disks from the same batch. Can't prove the same firmware version, though. A locality problem could arise if the journal area read from and written to is too small to be striped, at least if the code permits partial r/w. At least this seems logical. Can you confirm this assumption? I also had this issue a year ago, still with Deb 8 on another machine with a similar setup. — PoC, Oct 16 '18 at 08:21

score 2 · Answer 1 · answered Oct 16 '18 at 15:15

From the stripe parameter that you gave to mke2fs and mdadm, it appears that your chunk size that you specified is 512k. The problem that you are seeing is that the while the journal is spread out across all of the disks (it's going to be somewhere between 128MB to 1024MB depending on your file system size), the amount of data that needs to be written to the journal at each commit is not going to be very large. It's typically only a handful of blocks; maybe a few dozen, tops, for a sequential write workload. The problem is that those writes have to be synchronously written to disk at each commit, which by default happens every five seconds (which means that after a crash, you will lose at most 5 seconds worth of mdatadata updates). Let's assume that the average transaction size is 8 blocks. That means that it's going to take 16 commits, or 80 seconds, before the synchronous journal commits move to the next disk, and then that disk will be the getting all of the synchronous updates.

There's something else going on, though. The average request size for all of your disks (sdb..sde) is 8 sectors, or 4k. The average request size going into the md0 device is 840k which is respectable, but not huge. For some reason these writes are getting broken up into ~500 tiny-weeny 4k writes before they are getting sent to your disks. That's the biggest problem, and using a large chunk size is probably hurting, not helping.

What kind of disks are you using, and how are they connected to your system? Fixing this is going to be the biggest thing you can do to help.

As far as where to put your external journal, the general suggestion is to use a small SSD connected to your system.

I guess there is another caveat here as each of those writes will also need to update the parity disk. — kasperd, Oct 16 '18 at 16:19
@Theodore Ts'o: Thank you for your valuable input. Interestingly, I couldn't observe a change in which disk gets the load, even after half an hour. The Disks are Samsung 860 SSDs, they're connected via an onboard AHCI controller. Why are the I/O split into tiny requests: I set the I/O-Scheduler to noop and disabled I/O-Merges completely. This lessens I/O latency and a bit of CPU usage. SSDs controllers are fast enough for that kind of load, as I tested a few months ago. — PoC, Oct 18 '18 at 11:53
For your recommendation about the small SSD: That would be another SPoF. So I need to mirror this one. Because the write load is huge, I was about to opt for a DRAM-based solution. Unfortunately, the system is in production use now, so I can't test stuff anymore. — PoC, Oct 18 '18 at 11:53

Uneven disk load with mdadm RAID5 and ext4 filesystem with journal

1 Answers1