100% CPU load on AWS Aurora Postgres cluster after batch-deleting records

Question

We have a scheduled cleanup job that removes records that have a timestamp in the past. We select them in batches of 10’000 and delete them by primary key (composite key integer and timestamp). This job might delete up to 200’000 records within a few seconds.

We can consistently replicate that immediately after running this job the CPU load on all the readers goes to 100% (and therefore blocks the application). We can also consistently see that running VACUUM brings the load back to normal, immediately.

CPU load: batch-delete and VACUUM:

Is this a known issue? Anything we can do to avoid the 100% CPU spike?

Setup

Aurora (AWS) postgreSQL 12.6 cluster
- 1 writer, 2 readers (db.r5.2xlarge)
Database
- DB size: 50 GB
- Index size: 30 GB
- One table:
  - Primary key is a composite key (integer, timestamp)
  - Additional index on the timestamp column (in order to select records in the past)
Load
- CPU on writer: 15%-25%
- CPU on readers: 5%
Commit throughput
- 300 count / seconds on the writer
- 300 count / seconds on each reader

Auto-Vaccum settings:

Name	Value
autovacuum	on
log_autovacuum_min_duration	10s
autovacuum_max_workers	3
autovacuum_naptime	5s
autovacuum_vacuum_threshold	50
autovacuum_analyze_threshold	50
autovacuum_vacuum_scale_factor	0.1
autovacuum_analyze_scale_factor	0.05
autovacuum_freeze_max_age	200000000
autovacuum_vacuum_cost_delay	5ms
autovacuum_vacuum_cost_limit	1200

Please, edit you question for correct table look. Adding a line break after `Auto-Vaccum settings` is enough ;) — Aziza Kasenova, Jul 09 '21 at 08:36
Check the query plans `EXPLAIN (ANALYZE, BUFFERS)` on the replicas to see what changed. — jjanes, Jul 09 '21 at 13:56
Why do them in batches of 10,000 if you are going to run 20 batches back to back anyway? How many batches does it take to trigger the problem? — jjanes, Jul 09 '21 at 13:58
Have you looked at the WAL settings on the writer/source? This [AWS blog post](https://aws.amazon.com/blogs/database/best-practices-for-amazon-rds-postgresql-replication/) may help in identifying the cause of this issue. — vmachan, Jul 10 '21 at 15:19

100% CPU load on AWS Aurora Postgres cluster after batch-deleting records

0 Answers0