Linux Software RAID 5 Random Small Write Performance Abysmal - Reconfiguration Advice

Question

I have 3 1 TB HDDs and 3 500 GB HDDs. Right now each size grouping is in a RAID 5, both of which are in an LVM volume group (with striped LVs).

I'm finding this to be too slow for my usage on small random writes. I've fiddled with stripe sizes both on the RAID level and on the LVM stripe level, as well as stripe cache and readahead buffer size increases. I've also disabled NCQ as per the usual advice.

So I am done with Linux software raid 5. Without a dedicated controller, it's not useful for my purposes.

I am adding another 1 TB drive and another 500 GB drive so 4 of each.

How would you configure the eight drives to get the best small random write performance? Excluding simple RAID 0 of course, as the point of this set up is obviously also for redundancy. I have considered putting the 4 500 GB disks into 2 RAID 0s and then adding that to a RAID 10 of the other 4 1 TB HDs, for a 6 disk RAID 10 but I am not sure that this is the best solution. What say you?

Edit: There is no more budget for hardware upgrades. What I am really asking is, insofar as the four 1 TB drives can be RAID 10 pretty straightforwardly, what do I do with the four 500 GB drives such that they fit best with the 4x1TB RAID 10 without becoming a redundancy or performance problem? The other idea I had was to RAID 10 all four 500 GB drives together and then use LVM to add that capacity in with the 4x1TB RAID10. Is there anything better you can think of?

Another Edit: The existing array is formatted as follows:

1 TB ext4 formatted lvm striped file share. Shared to two Macs via AFP.
1 500 GB lvm logical volume exported via iscsi to a Mac, formatted as HFS+. Used a Time Machine backup.
1 260 GB lvm logical volume exported via iscsi to a Mac, formatted as HFS+. Used as a Time Machine backup.
1 200 GB ext4 formatted lvm partition, used a disk device for a virtualised OS installtion.
An lvm snapshot of the 500 GB time machine backup.

One thing that I haven't tried is replacing the Time Machine LVs with a file on the ext4 filesystem (so that the iscsi mount points to the file instead of a block device). I have a feeling that will solve my speed issues, but it will prevent me from being able to take snapshots of those partitions. So I am not sure it's worth the trade off.

In the future I intend to move an iPhoto and iTunes library on to the server on another HFS+ iscsi mount, the testing of which is how I began to notice the inane random write performance.

If you're curious, I used the info in the Raid Math section of this url: http://wiki.centos.org/HowTos/Disk_Optimization to figure out how to set everything up for the ext4 partition (and as a result I'm seeing excellent performance on it) however this doesn't seem to have done any good for the iscsi shared HFS+ volumes.

A lot more detail:

 output of lvdisplay:

  --- Logical volume ---
  LV Name                /dev/array/data
  VG Name                array
  LV UUID                2Lgn1O-q1eA-E1dj-1Nfn-JS2q-lqRR-uEqzom
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.00 TiB
  Current LE             262144
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     2048
  Block device           251:0

  --- Logical volume ---
  LV Name                /dev/array/etm
  VG Name                array
  LV UUID                KSwnPb-B38S-Lu2h-sRTS-MG3T-miU2-LfCBU2
  LV Write Access        read/write
  LV snapshot status     source of
                         /dev/array/etm-snapshot [active]
  LV Status              available
  # open                 1
  LV Size                500.00 GiB
  Current LE             128000
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     2048
  Block device           251:1

  --- Logical volume ---
  LV Name                /dev/array/jtm
  VG Name                array
  LV UUID                wZAK5S-CseH-FtBo-5Fuf-J3le-fVed-WzjpOo
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                260.00 GiB
  Current LE             66560
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     2048
  Block device           251:2

  --- Logical volume ---
  LV Name                /dev/array/mappingvm
  VG Name                array
  LV UUID                69k2D7-XivP-Zf4o-3SVg-QAbD-jP9W-cG8foD
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                200.00 GiB
  Current LE             51200
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     2048
  Block device           251:3

  --- Logical volume ---
  LV Name                /dev/array/etm-snapshot
  VG Name                array
  LV UUID                92x9Eo-yFTY-90ib-M0gA-icFP-5kC6-gd25zW
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/array/etm
  LV Status              available
  # open                 0
  LV Size                500.00 GiB
  Current LE             128000
  COW-table size         500.00 GiB
  COW-table LE           128000
  Allocated to snapshot  44.89% 
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     2048
  Block device           251:7


output of pvs --align -o pv_name,pe_start,stripe_size,stripes

PV         1st PE  Stripe  #Str
  /dev/md0   192.00k      0     1
  /dev/md0   192.00k      0     1
  /dev/md0   192.00k      0     1
  /dev/md0   192.00k      0     1
  /dev/md0   192.00k      0     0
  /dev/md11  512.00k 256.00k    2
  /dev/md11  512.00k 256.00k    2
  /dev/md11  512.00k 256.00k    2
  /dev/md11  512.00k      0     1
  /dev/md11  512.00k      0     1
  /dev/md11  512.00k      0     0
  /dev/md12  512.00k 256.00k    2
  /dev/md12  512.00k 256.00k    2
  /dev/md12  512.00k 256.00k    2
  /dev/md12  512.00k      0     0

output of cat /proc/mdstat

md12 : active raid5 sdc1[1] sde1[0] sdh1[2]
      976770560 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]

md11 : active raid5 sdg1[2] sdf1[0] sdd1[1]
      1953521152 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]



output of  vgdisplay:


--- Volume group ---
  VG Name               array
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  8
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                5
  Open LV               3
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               2.73 TiB
  PE Size               4.00 MiB
  Total PE              715402
  Alloc PE / Size       635904 / 2.43 TiB
  Free  PE / Size       79498 / 310.54 GiB
  VG UUID               PGE6Oz-jh96-B0Qc-zN9e-LKKX-TK6y-6olGJl



output of dumpe2fs /dev/array/data | head -n 100 (or so)

dumpe2fs 1.41.12 (17-May-2010)
Filesystem volume name:   <none>
Last mounted on:          /mnt/array/data
Filesystem UUID:          b03e8fbb-19e5-479e-a62a-0dca0d1ba567
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              67108864
Block count:              268435456
Reserved block count:     13421772
Free blocks:              113399226
Free inodes:              67046222
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      960
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              128
RAID stripe width:        128
Flex block group size:    16
Filesystem created:       Thu Jul 29 22:51:26 2010
Last mount time:          Sun Oct 31 14:26:40 2010
Last write time:          Sun Oct 31 14:26:40 2010
Mount count:              1
Maximum mount count:      22
Last checked:             Sun Oct 31 14:10:06 2010
Check interval:           15552000 (6 months)
Next check after:         Fri Apr 29 14:10:06 2011
Lifetime writes:          677 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      9e6a9db2-c179-495a-bd1a-49dfb57e4020
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x000059af
Journal start:            1




output of lvs array --aligned -o seg_all,lv_all

  Type    #Str Stripe  Stripe  Region Region Chunk Chunk Start Start SSize   Seg Tags PE Ranges                                       Devices                             LV UUID                                LV           Attr   Maj Min Rahead KMaj KMin KRahead LSize   #Seg Origin OSize   Snap%  Copy%  Move Convert LV Tags Log Modules 
  striped    2 256.00k 256.00k     0      0     0     0     0      0   1.00t          /dev/md11:0-131071 /dev/md12:0-131071           /dev/md11(0),/dev/md12(0)           2Lgn1O-q1eA-E1dj-1Nfn-JS2q-lqRR-uEqzom data         -wi-ao  -1  -1   auto 251  0      1.00m   1.00t    1             0                                                 
  striped    2 256.00k 256.00k     0      0     0     0     0      0 500.00g          /dev/md11:131072-195071 /dev/md12:131072-195071 /dev/md11(131072),/dev/md12(131072) KSwnPb-B38S-Lu2h-sRTS-MG3T-miU2-LfCBU2 etm          owi-ao  -1  -1   auto 251  1      1.00m 500.00g    1        500.00g                                        snapshot
  linear     1      0       0      0      0  4.00k 4.00k    0      0 500.00g          /dev/md11:279552-407551                         /dev/md11(279552)                   92x9Eo-yFTY-90ib-M0gA-icFP-5kC6-gd25zW etm-snapshot swi-a-  -1  -1   auto 251  7      1.00m 500.00g    1 etm    500.00g  44.89                                 snapshot
  striped    2 256.00k 256.00k     0      0     0     0     0      0 260.00g          /dev/md11:195072-228351 /dev/md12:195072-228351 /dev/md11(195072),/dev/md12(195072) wZAK5S-CseH-FtBo-5Fuf-J3le-fVed-WzjpOo jtm          -wi-ao  -1  -1   auto 251  2      1.00m 260.00g    1             0                                                 
  linear     1      0       0      0      0     0     0     0      0 200.00g          /dev/md11:228352-279551                         /dev/md11(228352)                   69k2D7-XivP-Zf4o-3SVg-QAbD-jP9W-cG8foD mappingvm    -wi-a-  -1  -1   auto 251  3      1.00m 200.00g    1             0                                                 




cat /sys/block/md11/queue/logical_block_size 
512
cat /sys/block/md11/queue/physical_block_size 
512
cat /sys/block/md11/queue/optimal_io_size 
524288
cat /sys/block/md11/queue/minimum_io_size 
262144

cat /sys/block/md12/queue/minimum_io_size 
262144
cat /sys/block/md12/queue/optimal_io_size 
524288
cat /sys/block/md12/queue/logical_block_size 
512
cat /sys/block/md12/queue/physical_block_size 
512

Edit: So no one can tell me whether or not there is something wrong here? No concrete advice at all? Hmmm.

And you did not mention if you have tried different I/O elevators. ;-) — Janne Pikkarainen, Nov 04 '10 at 17:33
For anyone who find this in the future, here's some general Linux parity RAID (level 5 and 6) tuning advice I didn't see in the answers or comments: (1) do not disable NCQ unless you have drives whose NCQ behavior hurts software RAID performance, and doing so anyway will often hurt performance; (2) increase the stripe cache size to reduce the RAID parity read-on-write penalty: `echo 8192 > /sys/block/md2/md/stripe_cache_size`. This allows the kernel to queue up more parity updates in a row, which could help performance with many small files. — Jody Bruchon, Nov 11 '15 at 03:33

score 5 · Answer 1 · answered Nov 03 '10 at 18:21

5

Sorry to say, but RAID 5 is ALWAYS bad for small writes unless the controller has plenty of cache. There is a lots of reads and writes for the checksum.

You best bed is Raid 10 on a hardware controller - for REAL screaming performance get something like an adaptec and make HALF the drives SSD.... this way all reads will go to the SSD which will give you tons of performance there, though writes obviously have to be split. Not sure Linux Software can do the same.

The rest depends totaly on your usage pattern, and basically - you did not tell us anything abou that.

answered Nov 03 '10 at 18:21

TomTom

51,649
7
54
136

Yes, I did tell you: small random writes. Hence my disappointment with the RAID5 performance. I understood the trade offs when I initially set up the RAID5 but didn't realise how much more prevalent small random writes would be as a percentage of the total iops. – RibaldEddie Nov 03 '10 at 18:27
I know - but that totally kills Raid 5. – TomTom Nov 03 '10 at 18:36
Hence the reason I want to stop using RAID 5 and start using RAID 10. – RibaldEddie Nov 03 '10 at 18:38
This isn't necessarily true if your storage and application are tuned correctly. Parity is why writes take so long, but most of the parity calculation isn't spent in processing, it's spent in reading the part of the stripe that's not being written. If your writes are a consistent size, you can size your stripe to eliminate partial writes, which substantially speeds up your writes because you don't need to read the remainder of the stripe to recalculate parity for that stripe. This necessitates not only sizing your segments properly, but your number of spindles in the array as well. – jgoldschrafe Nov 03 '10 at 18:53
@jgoldschrafe, yes I am aware of that. When I set up the RAID 5 arrays I performed some calculations that gave me a stripe size of 256k. I haven't been happy with the performance. – RibaldEddie Nov 03 '10 at 19:29
@jgoldschrafe Also for the record I am getting great speeds from my ext4 partition precisely because I spent the time to learn how to set everything up properly. I can write around 150MB/s and read about 300MB/s from my RAID 5 array on the ext4 partition. The slowness comes from an HFS+ partition that is exported to a Mac via iSCSI. I would need a separate RAID 5 specifically for the iSCSI mount. – RibaldEddie Nov 03 '10 at 20:24
Let me know if you think my edits make it clearer. – RibaldEddie Nov 03 '10 at 22:42

cagenut · Answer 2 · 2010-11-03T20:49:39.827

2

Option A.) Do you need the space? You could "short stroke" the 1TB drives to 500GB and run an 8-disk RAID10 array (for 2TB of usable space). Since you haven't mentioned I'm gonna assume they're all 7200rpm spindles, so you're looking at ~400 random writes per second.

Thats your best performance option, anything else would require better hardware or raid0.

Option B.) One 4-disk RAID10 array of 1TB drives, another 4-disk array of 500GB drives, simple lvm spanning. That gives you 200 random write iops on one and 200 random write iops on the other.

Option C.) One 8-disk RAID10 array of the first 500GB of all the drives, then a 4-disk RAID10 array of the "back" 500GB of the 1TB drives, lvm spanned. That'll give a peak 400 random write iops when you're on the 8-disk set section of the VG.

You didn't tell us really anything about the application, if its one sequential log write you're best off with C. If its broken up into at least two parallel writing threads I'd prefer the simplicity of B (and don't lvm them together).

edited Nov 03 '10 at 20:49

answered Nov 03 '10 at 19:08

cagenut

4,848
2
24
29

Yes I need the space. – RibaldEddie Nov 03 '10 at 19:31
Okay, see my edits. See if that helps you. – RibaldEddie Nov 03 '10 at 22:42
I'm being pedantic, but consumer SATA disks are typically rated between 70 and 80 IOPS, with enterprise disks topping out around 90, so your IOPS numbers are off by about 25%. Also, what are you recommending spanning? I'd definitely not take the spanning approach with anything performance-critical; LVM is non-deterministic in the first place and can make it really, extraordinarily difficult to get a proper partition alignment in the second. – jgoldschrafe Nov 04 '10 at 01:55
@jgoldschrafe the whole lvm being hard to align isn't really true: http://people.redhat.com/msnitzer/docs/io-limits.txt – RibaldEddie Nov 04 '10 at 03:32
Enterprise discs top at a LOT more than 90 iops - although not the 7200rpm variant. just to make clear. – TomTom Nov 04 '10 at 06:07

score 1 · Answer 3 · answered Nov 04 '10 at 09:39

In addition to configuring RAID and LVM, did you try a different disk I/O elevator? CFQ seems to be the default for many distributions nowadays and for certain workloads it's fine. For me it's bitten me badly couple of times -- for example one backup server backing up around 20 hosts, totalling around 30 million files and couple of terabytes, was surprisingly slow and I/O took a lot of time.

After I switched to deadline scheduler, all the operations at that server became about twice as fast as before. OK, in my case the filesystem was (and still is...) XFS, and in the past the XFS+CFQ combo has had its gotchas, but worth the try anyway.

If you want to change the I/O elevator on the fly:

echo deadline >/sys/block/yourdisk/queue/scheduler

If you want to make that change permanent, add to kernel line in your grub.conf -- or whatever boot loader you use -- parameter elevator=deadline.

You can also give a try to anticipatory and noop schedulers.

score 1 · Answer 4 · answered Nov 04 '10 at 10:15

Raid 5 is intrinsically bad for small writes, because it has to read each raid block on each drive first before writing to the disk. Hardware controllers get around that by having a battery backed cache which avoids having to wait for the disks to seek. Such a cache will help all small writes, not just on Raid 5, but it is especially useful there.

There could be a solution, though: try switching your filesystem to journalize data:

tune2fs -o journal_data /dev/md0

(That's for ext3 obviously)

You might also want to increase the size of the journal. You can go even faster by using another device for the journalling. Typically if you have a Raid 1 for your system and a big Raid 5 for data, reserve a volume on the first; that way committing the journal will be much faster since it will requires half as many seeks. (man tune2fs for more info on how to do this)

Important note: I haven't tested this. It should work but it's also possible that it does not give as many advantages as it theoretically could.

Linux Software RAID 5 Random Small Write Performance Abysmal - Reconfiguration Advice

4 Answers4