Recently I created a RAID5 with mdadm:
mdadm --create /dev/md0 -l 5 -n 4 -c 512 /dev/sdb /dev/sdc /dev/sdd /dev/sde
The usual tuning to speed up the initial sync:
echo 32768 > /sys/block/md0/md/stripe_cache_size
Then I left it alone to finish syncing.
Next, I created and tuned a file system on the device, optimized for a few huge files:
mke2fs -t ext4 -e remount-ro -Elazy_journal_init=0,lazy_itable_init=0,stride=128,stripe_width=384 -i 524288 /dev/md0
tune2fs -r0 -c0 -i12m -o ^acl,journal_data_writeback,nobarrier /dev/md0
I forced the ext4 data structure writeout at mkfs-time to prevent false benchmarks through background initialization. The options from /etc/mke2fs.conf are from Debian 9 and untouched by me.
Then I mounted this filesystem:
mount -o mand,nodev,stripe=1536,delalloc,auto_da_alloc,noatime,nodiratime /dev/md0 /mnt
Everything's fine, so far.
When I write (big) files to this filesystem, iostat -x 2
shows that one disk is loaded at 100% and the rest mostly idle.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 73.50 539.00 294.00 2155.25 8.00 146.37 238.01 188.30 244.79 1.63 100.00
sdc 0.00 0.00 4.50 545.00 18.00 2179.25 8.00 2.92 5.31 3.56 5.32 0.08 4.40
sdd 0.00 0.00 2.50 545.50 10.00 2181.25 8.00 2.90 5.30 4.00 5.31 0.09 4.80
sde 0.00 0.00 33.50 514.50 134.00 2057.25 8.00 2.96 5.39 0.12 5.74 0.07 4.00
md0 0.00 0.00 0.00 67.50 0.00 56740.00 1681.19 0.00 0.00 0.00 0.00 0.00 0.00
When I do all these steps again but omit creating a journal (mke2fs -O^has_journal
as additional parameter) disk load is spread evenly across all disks. So it seems the journal doesn't get spread all over the disks.
How can I benefit from a journal while retaining the ability to get more speed by loading all disks (more or less) evenly? Is this even possible while forcing all data through the journal with journal_data_writeback?
I thought about externalizing the journal, but where should I place it? A RAM-Disk is volatile, not good. Years ago, there were true DRAM-based solid state disks with battery backup available but it seems these have all been replaced by flash based SSD media. DRAM has no drawbacks with a mostly write-oriented load.
Addendum: The journal on disk is 1024M, according to this article. So it clearly shouldn't be a locality problem considering only size.