13

I am currently building a software RAID under Linux using mdadm utility and I've read a few articles that describe how to increase the stripe_cache_size value for that RAID and how to calculate an appropriate value for stripe_cache_size.

I have increased mine to 16384 and my current sync rate on a new RAID5 in /proc/mdstat has jumped from from 71065K/sec to 143690K/sec (doubled!) which is good news. I also see the matching and expected increase in RAM usage however I can't find any documentation on what this setting does and how it works.

It seems to be a cache of some sort for the RAID that exists in RAM. That is all I can tell from its name and the effects seen by changing it. Is there any official "Linux" documentation for this setting and description of it?

dawud
  • 15,096
  • 3
  • 42
  • 61
jwbensley
  • 4,202
  • 11
  • 58
  • 90

2 Answers2

7

from my understanding the stipe_cache_size is the number of stripe entries in the stripe cache. The stripe entries varies from systems to system but it is mostly controlled by the page size(default of 4096 bytes on linux systems)(https://github.com/torvalds/linux/blob/master/drivers/md/raid5.c#L73 , this file have all the logic of the stripe cache if you'd like to dig deeper) so in a 4 disk RAID5, a stripe_cache_size of 32768 will cost you 512MB of RAM. as far as I know it affect only raid5.

here are 2 documentation references: - https://github.com/torvalds/linux/blob/master/Documentation/md.txt#L603 - https://raid.wiki.kernel.org/index.php/Performance#Some_problem_solving_for_benchmarking

rudyattias
  • 163
  • 6
0

This helped explain the purpose to me: https://docs.kernel.org/driver-api/md/raid5-cache.html

write-back mode

write-back mode fixes the ‘write hole’ issue too, since all write data is cached on cache disk. But the main goal of ‘write-back’ cache is to speed up write. If a write crosses all RAID disks of a stripe, we call it full-stripe write. For non-full-stripe writes, MD must read old data before the new parity can be calculated. These synchronous reads hurt write throughput. Some writes which are sequential but not dispatched in the same time will suffer from this overhead too. Write-back cache will aggregate the data and flush the data to RAID disks only after the data becomes a full stripe write. This will completely avoid the overhead, so it’s very helpful for some workloads. A typical workload which does sequential write followed by fsync is an example.

In write-back mode, MD reports IO completion to upper layer (usually filesystems) right after the data hits cache disk. The data is flushed to raid disks later after specific conditions met. So cache disk failure will cause data loss.

In write-back mode, MD also caches data in memory. The memory cache includes the same data stored on cache disk, so a power loss doesn’t cause data loss. The memory cache size has performance impact for the array. It’s recommended the size is big. A user can configure the size by:

echo "2048" > /sys/block/md0/md/stripe_cache_size Too small cache disk will make the write aggregation less efficient in this mode depending on the workloads. It’s recommended to use a cache disk with at least several gigabytes size in write-back mode.

cvocvo
  • 183
  • 2
  • 3
  • 8