I have a postgres data warehouse type machine that does 24/7 continuous work. It goes through periods of very problematic performance which I've had the hardest time figuring out the cause of. Here are some bullet points:
- t3.2xlarge
- gp3, 1.6TB, 8000IOPS + 200mbps (though I've tried various levels of IOPS/mbps and also io1)
- ubuntu 18.04
Without going into specific database stuff which could be a whole other conversation, assume that everything else seems to be as reasonably well configured as one could expect.
The main issues are:
The "Average Queue Length" is always in the 3-4 range. My impression is that this is high, though I've had a hard time finding a concrete answer for that - what I've seen is mostly relative advice (lower is better - obviously).
Though not always, many times and for extended periods the total IOPS throughput is flatlined at exactly 4000 - despite my provisioning always being well above that. This is very specific and suspicious. Meanwhile, my combined throughput is << my provisioned throughput.
Combining 1 & 2 together, it is very clear that something is constraining the disk. I am unsure where to start (over again) to clearly answer that, despite trying many things already. Could it be a ubuntu setting? Could it be a network card thing? How do I know if AWS isn't just cheating me?