2

I tried autoscaling groups and alternatively just a bunch of EC2 instances tied by load balancer. Both configs are working fine at first glance.

But, when the EC2 is a part of autoscaling group it goes down sometimes. Actually it happens very often, almost once a day. And they go down in a "hard reset" way. The ec2 monitoring graphs show that CPU usage goes up to 100%, then the instance become not responsive and then it is terminated by autoscaling group.

And it has nothing to do with my processes on these instances.

When the instance is not a part of Autoscaling groups, it can work without the CPU usage spikes for years.

The "hard reset" on autoscaling group instances are braking my cron jobs. As much as I like the autoscaling groups I cannot use it.

It there a standard way to deal with the "hard resets"?

PS.

The cron jobs are running PHP scripts on Ubuntu in my case. I managed to make only one instance running the job.

timur
  • 14,239
  • 2
  • 11
  • 32
Yevgeniy Afanasyev
  • 37,872
  • 26
  • 173
  • 191
  • What do you mean by "hard rests" in ASG? – Marcin Feb 23 '21 at 07:08
  • By "hard rests" I meant "termination without giving time for the software to finalize the processes". I thought the "hard rest" means that someone just pulled out the power cable from the electric socket without waiting for computer to make a shut down. Please help me find a better word for it. Thank you. – Yevgeniy Afanasyev Feb 23 '21 at 22:52
  • 1
    you are not using spot instances by chance? what sort of scaling conditions have you got? – timur Feb 26 '21 at 02:34
  • No, I don't use spot instances and all the scaling conditions are defaults - Scaling policies(0), Scheduled actions(0). How often your instances from auto scaling groups are going down? I have even set everything to 2 but it did not help (Desired capacity - 2, Minimum capacity - 2, Maximum capacity - 2). – Yevgeniy Afanasyev Feb 28 '21 at 23:40
  • 1
    What kind of instances are you running? Burstable? – timur Mar 01 '21 at 07:01
  • @YevgeniyAfanasyev are you running cloudwatch client under those machines ? – Amine Bouzid Mar 01 '21 at 10:29
  • @timur, yes they are t2.micro, I guess it means they are Burstable. Why? – Yevgeniy Afanasyev Mar 02 '21 at 04:57
  • 1
    my initial guess was your instances are getting throttled so they cannot respond to a liveness check, but having read a bit more documentation i think that's not the case. – timur Mar 02 '21 at 05:02
  • @AmineBouzid, yes, I have cloudwatch – Yevgeniy Afanasyev Mar 02 '21 at 05:09
  • @YevgeniyAfanasyev you can write a script that runs on system shutdown to transfer system logs so you can investigate which process is causing 100% of CPU – Amine Bouzid Mar 02 '21 at 09:02

2 Answers2

2

It sounds like you have a health check that is failing when your cron is running, as as a result the instance is being taken out of service.

If you look at the ASG, there should be a reason listed for why the instance was taken out. This will usually be a health check failure, but there could be other reasons as well.

There are a couple things you can do to fix this.

First, determine why your cron is taking 100% of CPU, and how long it generally takes.

Review your health check settings. Are you using HTTP or TCP? What is the interval, and how many checks have to fail before it is taken out of service?

Between those two items, you should be able to adjust the health checks so that it doesn't take it out of service during the cron running time. It is possible that the instance is failing, typically this would be because it runs out of memory. If that is the case, you may want to consider going to a large instance type and/or enabling swap.

chris
  • 36,094
  • 53
  • 157
  • 237
1

Once I had a similar issue, in that situation was the system auto update running. The system (Windows server) was downloaded a big update and took 100% of the CPU during hours. My suggestion is to try to monitoring which service is running at that moment (even if the SO is Linux), also check for any schedule task (as looks like it is running periodically). Other than that try to keep the task list opened during the event and see what is going on.

  • I cannot reproduce it, it just happening sometimes. I have 4 ubuntu instances linked to an autoscaling groups. I may be wrong, but I think Ubuntu does not auto-update. Even if it was an auto-update it cannot happen every day. – Yevgeniy Afanasyev Mar 02 '21 at 03:59