Can someone help interpreting high latency / low throughput on a ESXi host (Direct-attached storage)

Question

as a disclaimer - I am not a storage guy, so ELI5 ;) I am looking at a ESXi with direct-attached storage (SAS SSDs and HDDs in a RAID1, different datastores). System X shown on the graphics is on the HDD RAID, the other one (System Z) on the SSDs.

Latency graph from ESXi - System X

Latency graph from ESXi - System Z

Both systems use databases (alongside other stuff). System X (shown in the graph) queries data from System Z (Postgres), imports it partially and displays it. As you can see we have some pretty high latency here. Also I can see only low throughput for the System X. System X has frequent database locks.

Both systems have CPUs and RAM galore, all I can see is disk performance issues.

W/o any additional infos - the latency seems crazy, am I right? My first advice was to separate the systems to dedicated datastores (and thus underlying disks) as they both tend to have very high IOPS requirements.

Unfortunately I do not have that many details, but I am looking for some good questions to ask in the end. I plan to look into the filesystem and mount options, the disk provisioning (thin / thick), maybe do some tests with dd / hdparm / fio. Check if we have write-back on the RAID. What else should I check?

Thanks, MMF

I would recommend to look at the RAID configuration to check how read and write cache is configured. If your esxi was installed using the vendor specific image you should have the command line tools to get those information (I've done it on HPE servers a few times). I would also check if you can replicate the latency with via command line file transfers between datastores. Hope it helps. — Simon Cateau, Aug 28 '20 at 15:46
Could you suggest how I can see latency with a command line file transfer? Or look at the graphs and just copy between the datastores? — MMF, Aug 31 '20 at 07:55
Yes that was the idea, a simple copy between datastores via command line will allow you to see if the latency is comparable. If it is then you're most likely looking at a datastore only issue. If it's not, there might be something higher up in the stack (OS version, drivers, etc.) that is causing the issue. Where you able to get the RAID caching config? What about data transfer latency on the same datastore? — Simon Cateau, Sep 14 '20 at 14:02
Unfortunately I did not get any further feedback, sorry for abandoning this post for so long! I appreciate your input, thanks a lot! — MMF, Oct 12 '20 at 15:09

Can someone help interpreting high latency / low throughput on a ESXi host (Direct-attached storage)

0 Answers0