Amazon Aurora read-only slaves slowly increase to 100% CPU and die

Question

We have a master database instance hosted on AWS Aurora (mysql) and have many read-only slaves being replicated from it. The master and the 4-12 autoscaling slaves are currently of db.r4.4xlarge size and engine version: 5.7.12.

Each slave comes online and performs for a few days but over that time its CPU usage slowly increases until each one has to be individually killed. Once killed another is automatically spun up and it continues.

Here's the performance graph of the slaves:

As you can see at 11pm our warehouse closes and CPU utilisation falls until the next day when it spikes and climbs above the previous day's. This increases day on day until it reaches 100% and has to be killed.

Have any of you guys every seen this pattern before and could give us a hint on where the problem might lie?

AWS support have been a bit useless to be honest and want to move us to another increased cost support tier for any further help. So before that happens we are entertaining other opinions. — Gary Willoughby, May 08 '19 at 14:32
High CPU usage almost always implies poorly performing `SELECTs`. Find a slow one, and present it to us for critique. Include `SHOW CREATE TABLE` and `EXPLAIN SELECT ...` — Rick James, May 10 '19 at 00:49
Can you try to gather details about, 1. Is there any network latency, 2. What is the IOPS stats says 3. enable slow queries, inspect and optimize them. 4. Is there any memory pressure. — asktyagi, May 11 '19 at 14:37

score 0 · Answer 1 · answered May 13 '19 at 11:50

I recommend to enable Amazon RDS Performance Insights to get tips about what may consume CPU.

With such symptoms and if no clue from SQL traffic, it would be helpful to use Linux "perf" to identify methods which consumes CPU (as far as binaries still have symbols - aka not striped) to confirm it does not come from internal replication management - but it cannot be used with RDS instances.

Amazon Aurora read-only slaves slowly increase to 100% CPU and die

1 Answers1