3

For over 2 weeks now I've observed that my RDS instance (PostgreSQL 10.6 on a db.t3.small) has daily 2-hour CPU peaks during business hours, together with increased read and write latency, that lead to poor responsiveness or timeouts in my application.

I did investigate (see below) and at this point, I'm quite convinced these peaks impacting my users aren't caused by my usage and tend to think they're caused either by rogue admin tasks from RDS or some PostgreSQL issue.

Did someone endure and resolve a similar issue with PostgreSQL? Can someone help me investigate the RDS admin tasks side? Or point me to other avenues to get to the bottom of these?

What I observe :

  • In the RDS dashboard, clear approximately 2-hour long CPU peaks with usage around 20%, whereas under normal usage the CPU stays well below 5%
  • Read and write latency increasing around these CPU peaks
  • Queries from my production application DB get slow and even timeout, making it unusable to my users

What I've investigated :

  • DB connections aren't high during the peaks, 0 to 10 max.
  • My DB is tiny, pg_size tells me it's 18MB! My tallest table has 1169 rows at the moment, none has more than 10 column.
  • free storage space is fine, still above 19000MB
  • I know my users aren't super busy as thankfully this is a sort of holiday period for their use of my application. And again, during days and timeframes where I know usage of my app was high CPU never went over 5% usage.
  • I do not have scheduled tasks or processes on this DB.
  • Logging all statements and the ones taking longer than 200ms confirms this, there aren't many statements happening apart from PgAdmin queries for stats taking less than 200ms which didn't have an impact on the CPU usage when I stopped them
  • Backups aren't to blame, they happen during the night and take around 3 minutes.
  • Not linked to bad queries or hanging transactions as far as I could see. I checked pg_stat_activity during a peak, checked duration of the ones "idle in transaction" and "active". There were 10-11 activities in there max. Nothing suspicious in the 4-5 from me. The rest were "rdsadmin" activities which I don't have the privileges to see the details of. The only activity I saw on which I had a slight doubt was from PgAdmin collecting stats but I killed it with pg_cancel_backend, killed my PgAdmin server, it disappeared and the peak continued for more than 30 minutes.
  • Performance Insights don't seem to point me to suspicious activity during these peaks.
  • In the basic PostgreSQL logs I see Checkpoints do become much longer (10 to 100 times longer) but rather well into the peaks and not at their start.

Here are basic logs around a peak start (before activating statement logs):

2019-12-09 15:04:05 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:04:05 UTC::@:[4221]:LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.202 s, sync=0.001 s, total=0.213 s; sync files=2, longest=0.001 s, average=0.000 s; distance=16369 kB, estimate=16395 kB
2019-12-09 15:09:05 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:09:05 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.001 s, total=0.112 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16394 kB
2019-12-09 15:14:05 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:14:05 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.113 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16393 kB
2019-12-09 15:19:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:19:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.001 s, total=0.113 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16392 kB

[CPU PEAK STARTS here that day, at 16:20 UPC+1]

2019-12-09 15:24:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:24:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.114 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16391 kB
2019-12-09 15:29:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:29:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.113 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16390 kB
2019-12-09 15:34:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:34:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.103 s, sync=0.002 s, total=0.118 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16390 kB
2019-12-09 15:39:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:39:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.104 s, sync=0.003 s, total=0.127 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16389 kB
2019-12-09 15:44:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:44:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.219 s, sync=0.010 s, total=0.303 s; sync files=2, longest=0.010 s, average=0.005 s; distance=16392 kB, estimate=16392 kB
2019-12-09 15:49:07 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:49:09 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.318 s, sync=0.516 s, total=2.426 s; sync files=1, longest=0.516 s, average=0.516 s; distance=16375 kB, estimate=16390 kB
2019-12-09 15:54:07 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:54:09 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.367 s, sync=1.230 s, total=2.043 s; sync files=1, longest=1.230 s, average=1.230 s; distance=16384 kB, estimate=16389 kB
2019-12-09 15:59:07 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:59:08 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.139 s, sync=0.195 s, total=1.124 s; sync files=1, longest=0.195 s, average=0.195 s; distance=16383 kB, estimate=16389 kB

CPU around 1 peak, CPU over a week, Read latency around a peak, Write latency around a peak,Performance Insights around Dec 10 peak, Performance Insights around Dec 9 peak

user2870968
  • 493
  • 4
  • 8

1 Answers1

0

It could be that you are running out of burst credits on your disk because of a background process of PostgreSQL. All disks on Rds are of the gp2 type if I remember correctly. Meaning you have a certain base iops and credits you can spend to go above that for a small period of time. You should be able to see this effect in the queue depth metric on the monitoring page. If this is happing you should see a peak in that number . The easiest solution is to just increase disk size.

  • Thanks for the insight Stijn. I've checked Queue Depth around my past daily CPU peaks. I see Queue Depth peaks around but not consistently before CPU peaks start. I also see peaks in Queue Depth, that are higher than the ones around CPU peaks, but didn't cause any issue AFAIK. The highest such peak in the past 2 weeks is at a value of 0.22, no idea whether that's much. So I would think this is not the cause. If you're interested the investigation has progressed, sort of, [here](https://forums.aws.amazon.com/thread.jspa?threadID=314052) – user2870968 Dec 16 '19 at 08:53
  • A peak of 0.22 is nothing. So nothing to worry about there. Another thing you can scratch of the list I guess :) – Stijn De Haes Dec 16 '19 at 09:29