The setup I have a bunch of t2.small EC2 instances running hosting the image processing library called thumbor for simple on-the-fly image resizing. Originals are loaded from S3. In front of the instances I have an EC load balancer. I have New Relic server monitoring installed in the servers.
The problem At random times, my servers suddenly start to experience extremely high avg. response times. If I look at the stats in New Relic, the only thing I see, is that the servers CPU spikes out consistently reporting "stolen" CPU.
My servers seems to have high enough capacity and it's NOT like there are any extreme spikes in throughput meanwhile.
I have noticed, that if I stop/start the servers again. Then the Stolen CPU disappears, and they run fine again - until next time - it could hours or days between.
Why is this happening, and what can I do about?