Why is my AWS instances suddenly becoming irresponsive reporting high "stolen" CPU

Question

The setup I have a bunch of t2.small EC2 instances running hosting the image processing library called thumbor for simple on-the-fly image resizing. Originals are loaded from S3. In front of the instances I have an EC load balancer. I have New Relic server monitoring installed in the servers.

The problem At random times, my servers suddenly start to experience extremely high avg. response times. If I look at the stats in New Relic, the only thing I see, is that the servers CPU spikes out consistently reporting "stolen" CPU.

My servers seems to have high enough capacity and it's NOT like there are any extreme spikes in throughput meanwhile.

I have noticed, that if I stop/start the servers again. Then the Stolen CPU disappears, and they run fine again - until next time - it could hours or days between.

Why is this happening, and what can I do about?

You probably ran out of CPU credits. – Michael Hampton Feb 10 '17 at 16:23 — Michael Hampton, Feb 10 '17 at 16:23

score 11 · Accepted Answer · answered Feb 10 '17 at 16:24

The t-series of instances at Amazon use a quota system for CPU usage. When you reach your quota, you start seeing your stolen percentages rise. There isn't much you can do about that, it's structural to the offering.

Use less CPU overall.
Use a larger t-series instance.
Use one of the m-series or c-series, which doesn't have a quota.

Tim · Answer 2 · 2017-03-14T22:20:17.093

As has been said by others, you're very likely running out of CPU credits. Basically, with T2 instances you get a fraction of a CPU, 20% of a core in the case of a t2.small, with the ability to burst to one or two cores (depending on your instance type) up to the limit of your CPU credits. You also shouldn't use T instance behind a load balancer in most cases, because of the variable performance it can cause odd problems that can be difficult to diagnose.

If you're running out of credits you need to move to a larger T instance, or move to an instance that has consistent access to cores. C (compute optimised) or M (general purpose) would be more appropriate.

You can monitor your CPU credits with CloudWatch. This will help you decide whether to go with a larger T instance or a C/M instance.

Why is my AWS instances suddenly becoming irresponsive reporting high "stolen" CPU

2 Answers2