0

I have a ESXi 5.5 server and I'm experiencing high write latency on a local datastore.

This datastore is on a virtual disk provided by a RAID card (two SATA disks on RAID 1).

When I copy large files, it takes ages to complete the transfer and write latency averages 84ms! This is way too much.

I know RAID 1 doesn't improve write rates, btw.

So I'm trying to find where the bottleneck is. Could it be the RAID card? (PCI-e 8x, 100% hardware). Could it be a fragmentation issue? (Not very probable on VMFS).

If you have already experienced high latency on a local datastore, I'd like to have your feedback. Thanks :)

mimipc
  • 1,947
  • 3
  • 19
  • 27
  • 1
    What controller are you using? What is the server hardware? What types of disks are in use? Do you have a battery-backed or flash-backed RAID controller in place? If so, what is the cache size and ratio? – ewwhite Dec 02 '13 at 14:31

2 Answers2

3

SImply said your problem is:

(two SATA disks on RAID 1).

Turn it how you like, but two likely very slow discs are 2 very slow discs and nothing except heavy caching will work around it. You have a small IOPS budget right there and the only thin that can fix that is having a larger one.

Example - using Raid 10 with 8-10 discs will give you a lot more IOPS. Using an enerprise Raid controller (like the Adaptec 71605Q) and putting in multiple SSD as transparently used cache will fix the write issue. I am regularly copying files with 500mb/s to a Raid 6 thanks to that.

But your problem is that 2 not fast (i.e. max. 7200 RPM) SATA discs are just that and you dont seem to have the other hardware to mitigate this.

TomTom
  • 51,649
  • 7
  • 54
  • 136
  • I understand this quite well, but 84ms (avg) seems way, way too much for 7200rpm drives. I can't afford the solutions you're talking about, as it is for a small test environnement, but I don't understand why it is SO slow. – mimipc Dec 02 '13 at 12:02
  • 3
    Hardly. If you overload them that happens. VmWarel ikely hits with a VERY long queue. I regularly have 200+ms delay on a 8 disc RAID 10 for my database servers. It is not SLOW - it is overloaded. Because copying large files overloads it. – TomTom Dec 02 '13 at 12:20
1

If your guest is a linux, you can say him to make much agressiver write caching, which helps a lot to deal with such write latencies. The default 5 or 30 seconds write cache flushing interval is coming from the dreams of the filesystem/vm developers, where they never need to debug mystic problems originating from some hardware problems, while their code is okay.

The following sysctl settings make you much better balanced write operations:

vm.dirty_background_ratio = 20
vm.dirty_expire_centisecs = 360000
vm.dirty_writeback_centisecs = 360000

(Other OSs (incl. esxi) have the same problem too, but there you can't change vm writeback params so easy.)

peterh
  • 4,953
  • 13
  • 30
  • 44
  • Nice explanation on this https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ – col.panic Jan 15 '18 at 07:08