0

Can't understand why writing speed on 10xHDD SAS bulided in RAID5 is too slow. Read Cache: Enabled Write Cache: Disabled (server has no battery) Strip size: 512k

-----------------------------------------------------------------------
CrystalDiskMark 3.0.1 x64 (C) 2007-2010 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   371.345 MB/s
          Sequential Write :    22.066 MB/s
         Random Read 512KB :  1710.567 MB/s
        Random Write 512KB :    18.550 MB/s
    Random Read 4KB (QD=1) :    78.245 MB/s [ 19102.9 IOPS]
   Random Write 4KB (QD=1) :     0.654 MB/s [   159.6 IOPS]
   Random Read 4KB (QD=32) :   538.820 MB/s [131547.9 IOPS]
  Random Write 4KB (QD=32) :     2.214 MB/s [   540.5 IOPS]

  Test : 50 MB [E: 0.0% (0.4/16740.0 GB)] (x2)
  Date : 2016/09/28 12:35:44
    OS : Windows NT 6.2 Server Standard Edition (full installation) [6.2 Build 9200] (x64)

--------

Testing in the same OS in virtual machine on ESXi6U2 gives the same result. Controller: Logical SAS (Default settings when creating VM).

Using hardware RAID controller: Adpatec 8405.

Why is too slow? Thanks for any solutions.

A_buddy
  • 35
  • 2
  • 11
  • I wonder what the reads without cache are like. Can you do an hdparm -Tt ? Usually, write on RAID5 is around 1/4 of the read, because of parity calculation. – mzhaase Sep 28 '16 at 14:38
  • Read cache is enabled. Write cache was disabled because of no battery. – A_buddy Sep 28 '16 at 18:55
  • 1
    Oh god, please tell me those aren't =>1TB disks? RAID 5 has been dead/dangerous for about 7 years or more – Chopper3 Sep 29 '16 at 09:42
  • We have two servers and two RAID5 connected using Adpatec 8405. First one is RAID5 on 10 x SAS HDD 2Gb Hitachi, second is on 11 x SAS HDD 2Gb HGST. HGST works good when 3 of 11 Hitachi HDD was returned to factory because of ussies. – A_buddy Sep 29 '16 at 10:12
  • I'm being very serious now - RAID 5 is dangerous - we can bore you with the maths behind this but essentially with modern disks of =>1TB when you have to replace a disk you can be certain of introducing a least one unrecoverable read error into your data. This is very well documented (http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/) and we as professional sysadmins here go out of our way to strongly urge people to not use R5 anywhere - R6/60 and R1/10 are fine but R5/50 is going to kill your data sooner or later - we have a lot of posters on here asking for recovery help with R5 – Chopper3 Sep 29 '16 at 16:22
  • @Chopper3 This also applies to RAID-1 and RAID-6 in almost the same way. One countermeasure is to always use hot spares and buy drives from different charges. The other is to simply not use RAID at all but specialized filesystems that can handle this much better. – hurikhan77 Sep 29 '16 at 17:18
  • Well R6 at least has two copies of the XOR so there'll be a copy for the rebuild. – Chopper3 Sep 29 '16 at 17:20
  • Please remember that enterprise disks (SAS) have much higher URE... – neutrinus Oct 07 '16 at 10:29

6 Answers6

3

You should absolutely not use RAID-5 without battery - no matter if you use write-caching or not. Any RAID is subject to write-hole without battery buffering. Plus, it will increase the write-performance a lot when using a battery buffer.

But if you totally insist on using no battery buffering, try lowering the stripe size. 512k seems huge if you mostly do random, small IO. On a 3x HDD sized array, you need to write contiguous blocks of 1 megabyte to saturate the IO path. Doing smaller IO results in write amplification due to read-modify-write cycle. That means, your array reads 1 MB of net data, modifies 4k, and needs to write 1 MB again. Adding seeking overhead explains why even 512k performance is so low (amplification factor is 2, rewriting data needs to wait almost one revolution of the platters, adding 8ms of IO wait per 1M of data written). Effectively, you can only transfer 512k per 16ms this way, which is about what you get: less than 32 MB/s (if your disks have 8ms access time). I even suggest that the average access time is your biggest problem here. Get a battery buffered cache, there's no way around it. And use SSD CacheCade to reduce seek overheads.

You may get around by using UPS and write-back caching but without BBU on the controller, the cache may still be subject to application cache flushes and write-barriers which then still results in poor performance at times.

If you cannot predict the write patterns of your application, I'd turn stripe size down a lot, especially if you're going to use lot of spindles.

hurikhan77
  • 477
  • 3
  • 9
  • 23
  • Use low settings for mail servers, medium settings for database servers, and high settings for file servers. In linux, you can additionally increase nr_requests to better saturate the IO queue inside the controller, and use deadline scheduler to give it sorted IO while maintaining good latency. – hurikhan77 Sep 29 '16 at 16:34
  • Bigger stripes will make your benchmarks worse (at least for small IO). You need to learn to interpret the benchmarks for the IO patterns you are going to use and how this correlates with settings of the RAID controller. – hurikhan77 Sep 29 '16 at 16:36
  • This server is for videocams. About 6-7Mb/s writing from 1 HD camera. That's why I'm not sure about low strip size. – A_buddy Sep 29 '16 at 16:37
  • Depends on the buffering of the streamer and the OS. If you use linux, I suggest using XFS for this as it optimizes for RAID stripes and streaming automatically without too much seeking overhead, encouraging the controller to write diagonally. Think of block size by (spindles-1) x stripe_size. So I suggest using 128k stripes as a start for real data benchmarks. – hurikhan77 Sep 29 '16 at 16:46
  • Ah, and never use 4-HDD RAID-5 because of bad alignment. Use 3, 5, or 9 disks. – hurikhan77 Sep 29 '16 at 16:48
  • Need to use NTFS in windows 7 (requirments for video softweare) virtual machine in ESXi6.0U2. 11 disks is also ok with aligement? – A_buddy Sep 29 '16 at 16:48
  • ESXi uses a very tuned IO scheduler (something that native Windows doesn't have afaik). So, that already gives you benefits. NTFS should work well enough for streaming writes. If there's any tuning parameters in the camera software try to match up with your stripe size (multip. by spindles-1). – hurikhan77 Sep 29 '16 at 16:51
  • I think the RAID-5 alignment formula was something like (2^n)+1 disks... That makes 2, 4, 8, 16 plus 1 each. I'm not sure about it, tho. But it's not that bad of a problem if you stay with odd numbers. 11 should be okay. – hurikhan77 Sep 29 '16 at 16:58
  • Keep in mind that you only fall out of read-modify-write overhead if you write blocks of full stripe chunks (1M in your benchmark case). Your video streamer probably writes smaller blocks. Adding more spindles later only gets you back into this situation if you start with too big stripes, and not using caching. – hurikhan77 Sep 29 '16 at 17:13
  • About 50-60 cameras for one server was in requirments. So... Set stripec size to the minimum with enabled write-cache or not? – A_buddy Sep 29 '16 at 17:17
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/46083/discussion-between-hurikhan77-and-alexander-b). – hurikhan77 Sep 29 '16 at 17:19
2

Why write cache is disabled ? If you have a battery on the RAID controller (no informations found) you can activate Write Back mode.

What about RAID 50 ? You loose disk space but really increased performances.

I think RAID 5 on 10 disks is slowed down by parity calculation, but I'm not expert.

Leahkim
  • 175
  • 6
  • Thanks for the quick reaction. This time I have no battery. I think I should test speed with 4xHDD RAID5 and compare first. Anyway this question is open. – A_buddy Sep 28 '16 at 14:11
  • Adding spindles will help you best with controllers that can stripe diagonally accross all spindles. – hurikhan77 Sep 29 '16 at 16:39
2

It typical RAID5 issue. Need write cache to avoid read-modify-write cycle. It degrade performance, sometimes very strongly. https://en.wikipedia.org/wiki/Read-modify-write I have some this type RAIDs, without cache it write very slow.

0
  1. Yes, absolutely right. At least I need to enable 'Write Cache' with 'Write-Back' option to increase perfomance on random data write commands.

The reults with write-cache:

-----------------------------------------------------------------------
CrystalDiskMark 3.0.1 x64 (C) 2007-2010 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :  1950.515 MB/s
          Sequential Write :  3545.783 MB/s
         Random Read 512KB :  2932.946 MB/s
        Random Write 512KB :  2980.421 MB/s
    Random Read 4KB (QD=1) :    79.481 MB/s [ 19404.4 IOPS]
   Random Write 4KB (QD=1) :    78.543 MB/s [ 19175.5 IOPS]
   Random Read 4KB (QD=32) :   551.817 MB/s [134721.0 IOPS]
  Random Write 4KB (QD=32) :   519.086 MB/s [126729.9 IOPS]

  Test : 50 MB [E: 0.0% (0.4/16740.0 GB)] (x2)
  Date : 2016/09/28 14:25:09
    OS : Windows NT 6.2 Server Standard Edition (full installation) [6.2 Build 9200] (x64)

Both results was taken on the same 10xHHD RAID5 using Adaptec 8405. Sector: size 4k Stripe size: 512k

  1. I tinnk, sometimes 1 channel SAS controller is not enough for I/O. Use single-channels or extensions.
  2. RAID50 is the option to increase speed.

The results of RAID50 at the same hardware:

-----------------------------------------------------------------------
CrystalDiskMark 3.0.1 x64 (C) 2007-2010 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :  1799.496 MB/s
          Sequential Write :  3423.066 MB/s
         Random Read 512KB :  2450.107 MB/s
        Random Write 512KB :  2627.551 MB/s
    Random Read 4KB (QD=1) :    77.889 MB/s [ 19016.0 IOPS]
   Random Write 4KB (QD=1) :    76.530 MB/s [ 18684.0 IOPS]
   Random Read 4KB (QD=32) :   519.356 MB/s [126796.0 IOPS]
  Random Write 4KB (QD=32) :   522.408 MB/s [127541.0 IOPS]

  Test : 50 MB [E: 0.0% (0.6/14879.9 GB)] (x2)
  Date : 2016/09/29 6:41:13
    OS : Windows NT 6.2 Server Standard Edition (full installation) [6.2 Build 9200] (x64)

Anyway, I'm still looking for solutions to make write acess faster as much as it possible.

Deep info about stripe (thanks to hurikhan77):

'You should absolutely not use RAID-5 without battery - no matter if you use write-caching or not. Any RAID is subject to write-hole without battery buffering. Plus, it will increase the write-performance a lot.

But if you totally insist on using no battery buffering, try lowering the stripe size. 512k seems huge if you mostly do random, small IO. On a 3x HDD sized array, you need to write contiguous blocks of 1 megabyte to saturate the IO path. Doing smaller IO results in write amplification due to read-modify-write cycle. That means, your array reads 1 MB of net data, modifies 4k, and needs to write 1 MB again. Adding seeking overhead explains why even 512k performance is so low (amplification factor is 2, rewriting data needs to wait almost one revolution of the platters, adding 8ms of IO wait per 1M of data written). Effectively, you can only transfer 512k per 16ms this way, which is about what you get: less than 32 MB/s (if your disks have 8ms access time). I even suggest that the average access time is your biggest problem here. Get a battery buffered cache, there's no way around it. And use SSD CacheCade to reduce seek overheads.'

A_buddy
  • 35
  • 2
  • 11
  • Get a battery and enable write-cache. – Jeter-work Sep 28 '16 at 14:51
  • I expect that for a video recording system you probably use UPS anyway, so there should be almost no downside in enabling controller-based caching but please turn off disk-device-based caching as it adds to the write-hole problem in this case. – hurikhan77 Sep 29 '16 at 17:07
  • hmmm, disabling write-caching on Adaptec 8405 slow downs the results of speed testing. May be battery + write-cache will be good enough? I don't know the all specification of all future video system. But I think that's will be IP cameras and some software. – A_buddy Sep 29 '16 at 17:13
  • If it's IP cameras... What's the limit of your network bandwidth? ;-) – hurikhan77 Sep 29 '16 at 17:57
  • Hah, indeed. 2x 1gb physical adapters. But I am not sure that the stream and I/O is equal. – A_buddy Sep 29 '16 at 18:05
  • It should be. IP cameras usually deliver raw MPEG streams which can be directly stored (including maybe some multiplexing headers). – hurikhan77 Sep 29 '16 at 18:43
0

Another idea: You can configure the kernel scheduler: https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt

Use the noop scheduler instead of the cfq one. The raid controller has its own scheduler so your packets are scheduled 2 times before writing to device.

To set it: echo 'noop' > /sys/block/DEVICENAME/queue/scheduler

You can make it persistent by using the sysfsutils package on debian.

Another idea again: Increase your RAM. Linux will automatically use it to cache files. I know it's only moving the problem, but if your write times are short it could be the best way.

Leahkim
  • 175
  • 6
  • 1
    This won't help here: As the RAID controller does no caching, you cannot benefit from using noop. Better use deadline in that case. And maybe increase nr_requests in the block device queue settings as SAS drives and RAID controllers usually offer deep IO queues. – hurikhan77 Sep 29 '16 at 16:22
0

Your slow write speed is due to how RAID5 manages incoming data when writing smaller-than-full-stripes elements. In that case, the controller is forced to execute repeated read-modify-write patterns in order to correctly write the incoming data, killing performance.

In your specific case, this problem is exacerbated by two factors:

  • no write cache, which can be used to coalesce multiple small writes into a single, bigger full-stripes write
  • big stripe size (512 KB), which means that very few writes (if any) will be sufficiently big to imply a full-stripes write (avoiding the bad read-modify-write scenario).

What can you do to mitigate the problem?

  • add a battery and enable write-caching: sequential writes, and some random ones also, will be immediately much faster (as the controller can coalesce multiple small, consecutive write into bigger ones)
  • reduce your stripe size to 32/64 KB: with smaller stripes, a much higher percentage of writes can be full-stripes-writes and avoid read-modify-write operations.
shodanshok
  • 47,711
  • 7
  • 111
  • 180