High CPU load on EC2 with Nginx/Celery/Django causes server to fail

Question

I am running a Django web app on an EC2 server using Nginx, and uWSGI. I also have Celery running some background tasks (no CRON jobs, just on occasional user actions).

The app is in early closed Beta with no users currently active.

Over the past three days, the server would fall over after experiencing super high CPU loads, seemingly randomly (see screengrab).

Before this, the app was running without issue for weeks. I made some programmatic changes to the website, but not to the server configuration (consolidating models mostly).

I tried to pick something up from the logs (Nginx access.log, error.log and Django debug.log), but I don't see any errors or oddities (don't have access to the logs right now).

In addition, I experienced a similar effect when migrating model changes (in venv) if I haven't restarted the server beforehand. Sometimes, even when restarting the server, it would become so slow I would have to wait several minutes for Celery to restart.

I need help to find a starting point to investigate the problem. Any ideas?

score 0 · Accepted Answer · answered Jul 08 '20 at 15:12

After some testing and assessment, I saw that my drive space was 99% full. After cleaning it up by removing Django's debug log file as well as some other log files, the server became much more stable with no events in the last 24 hours.

It did force me to implement some additional measures to monitor via Nginx Amplify, which is a great tool to help catch server issues.

I believe that the CPU went into overdrive in an effort to compensate for the lack of space, and cleaning the drive up solved the issue.

High CPU load on EC2 with Nginx/Celery/Django causes server to fail

1 Answers1