How to debug an ECS Fargate service that occasionally restarts task due to unhealthy elastic load balancer health checks

Question

I'm hosting a shiny app on ECS Fargate. It works fairly well but then occasionally when using the app it crashes. I traced it to the following in the events tab:

service YYYY has started 1 tasks: task XXX
service YYYY has stopped 1 running tasks: task XXX
service YYYY deregistered 1 targets in target-group (Name of Elastic Load Balancer)
service YYYY (port 3838) is unhealthy in target-group (Name of Elastic Load Balancer) due to (reason Request timed out).

Does anyone know what might be causing this? Or alternatively how can I investigate this further?

Could this be linked to spikes in CPU utilization within the application?

I've seen that at certain times the CPU utilization is spiked to 100%. So If the user uses the application in a way that causes this high utilization, could this cause the container to be deemed unhealthy?

Also, auto-scaling is enabled for the application for when the CPU > 50% - however this is not being activated in the moments when the CPU utilization spikes to 100%. Any ideas?

score 1 · Answer 1 · answered Jan 08 '21 at 02:20

You can get details about stopped tasks on the ECS Console Cluster -> Tasks -> Stopped and then enter in the specific task ECS Console

Additionally in that tab you can get the logs of the container if you have configured the appropiate log driver in the task definition

score 0 · Answer 2 · answered Jan 07 '21 at 21:19

Does the application write any logs? Make sure those logs are getting sent to the container's console so they show up in CloudWatch logs for ECS.

Add the following to your Dockerfile to get logs to output to the console:

RUN ln -sf /proc/self/fd/1 /var/log/mylocation/mylogfil.log && \
    ln -sf /proc/self/fd/1 /var/log/mylocation/myerrorfile.log

How to debug an ECS Fargate service that occasionally restarts task due to unhealthy elastic load balancer health checks

2 Answers2