We have a master database instance hosted on AWS Aurora (mysql) and have many read-only slaves being replicated from it. The master and the 4-12 autoscaling slaves are currently of db.r4.4xlarge size and engine version: 5.7.12.
Each slave comes online and performs for a few days but over that time its CPU usage slowly increases until each one has to be individually killed. Once killed another is automatically spun up and it continues.
Here's the performance graph of the slaves:
As you can see at 11pm our warehouse closes and CPU utilisation falls until the next day when it spikes and climbs above the previous day's. This increases day on day until it reaches 100% and has to be killed.
Have any of you guys every seen this pattern before and could give us a hint on where the problem might lie?