6

One of my ECS fargate tasks is stopping and restarting in what seems to be a somewhat random fashion. I started the task in Dec 2019 and it has stopped/restarted three times since then. I've found that the task stopped and restarted from its 'Events' log (image below) but there's no info provided as to why it stopped..

enter image description here

So what I've tried to do to date to debug this is

  1. Checked the 'Stopped' tasks inside the cluster for info as to why it might have stopped. No luck here as it appears 'Stopped' tasks are only held there for a short period of time.
  2. Checked CloudWatch logs for any log messages that could be pertinent to this issue, nothing found
  3. Checked CloudTrail event logs for any event pertinent to this issue, nothing found
  4. Confirmed the memory and CPU utilisation is sufficient for the task, in fact the task never reaches 30% of it's limits
  5. Read multiple AWS threads about similar issues where solutions mainly seem to be connected to using an ELB which I'm not..

Any have any further debugging device or ideas what might be going on here?

Strokes
  • 157
  • 7
  • 23
  • 1
    Some other thoughts; is auto scaling active? Is it related to the task that started just before hand? Memory could have reached max for an unknown reason even though 'normally' it's within limits? – Tobin Apr 01 '20 at 16:19
  • @Tobin Appreciate the response! Auto scaling is not active. It's likely related to the task that started just before, it's the same task but it still begs the question why is it stop/starting. I've had a look at the memory and cpu usage of the task for the past 3 months and neither parameters exceed 30% of their maximum. – Strokes Apr 01 '20 at 18:07
  • 1
    Both accessing the same resource with a lock out? Uncaught exception as a result? How about local domain to r53 got changed? If a new task was launched it makes sense to terminate the old one as desired is 1. Why was the other started? – Tobin Apr 01 '20 at 19:02
  • @Tobin the thing is that both tasks are from the same task definition. I've seen similar behaviour before with other tasks whereby there's a memory overflow so the service restarts the task but this isn't the case here. I'm not using r53 so no issues on that front – Strokes Apr 01 '20 at 20:45
  • @Tobin For now I've set up a SNS alert so the next time the task stops, I should have more info, as per https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_cwet2.html I'll update this issue should I find a resolution – Strokes Apr 02 '20 at 09:37
  • Yeah, good call – Tobin Apr 03 '20 at 06:21

1 Answers1

2

I ran into the same issue and found this from aws

https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-maintenance.html

When AWS determines that a security or infrastructure update is needed for an Amazon ECS task hosted on AWS Fargate, the tasks need to be stopped and new tasks launched to replace them.

Also a github post on storing stopped tasks info in cloudwatch logs:

https://github.com/aws/amazon-ecs-agent/issues/368

indybee
  • 1,507
  • 13
  • 17