NGINX randomly stops working, required manual restart

Question

I have an issue I am not sure how to troubleshoot. My setup:

Amazon EC2 (t2.medium) running Ubuntu Linux 16.04 (fully up to date)
NGINX 1.10.3
8 websites running Node JS (Express) that are bound to ports 3000-3007 through pm2, with NGINX as the reverse proxy (proxy_pass in virtual host files)
PHP 7.1 (to power a Wordpress site)
The Node sites use the Wordpress REST API (from the Wordpress site) to serve content

The Issue:

Every few days it seems like NGINX stops working. I can tell because I am unable to access the Wordpress site until I run sudo service nginx restart. It does not seem to be a PHP issue, since if I restart PHP the Wordpress site DOES NOT go back online until the NGINX restart. The server logs in /var/log/nginx don't seem to give any insight, and I am unsure how to troubleshoot the issue.

Any ideas on where to start? Any monitoring I can set up (apart form just a basic "site down") that might provide insight? Maybe there is some setting that I can toggle in NGINX that handles overuse (if that is the issue)?

As you mentioned how you detect the site is down, can you verify that node applications are down too? - If node apps cannot be accessed, then we can be sure that this is a nginx problem and you may need to show us `/var/log/nginx` — mixth, May 23 '18 at 05:42
@mixth, yep the Node applications go down along with the Wordpress app. Let me dig through the `nginx` logs and see if I can find a time snippet around the time when the sites went down last. — Kirill Miniaev, May 24 '18 at 11:47

score 7 · Accepted Answer · answered Oct 24 '18 at 11:32

I encountered a similar issue when using nginx with certbot. I am hosting under Ubuntu 16.04 LTS and certbot is quite outdated (0.10.2).

As described here this version of certbot suffers an issue when emiting a certificate. The standard commands don't works, specific commands must be used.

Certbot comes with an auto updater that will renew certificates automatically. This updater fails to use the workaround and also fails to start the nginx service after operations.

What I did is to disable this service. There is a file at /etc/systemd/system/timers.target.wants/certbot.timer. Edit this file and comment the lines that enable the timer.

[Unit]
Description=Run certbot twice daily

[Timer]
OnCalendar=*-*-* 00,12:00:00
Persistent=true

#[Install]
#WantedBy=timers.target

Now you will have to renew the certificates manually.

score 0 · Answer 2 · answered Oct 23 '18 at 01:06

How do you specify the upstream servers for nginx?

You should note that normally, http://nginx.org/r/proxy_pass caches the resolution of the domain names at startup time, unless you're using variables within proxy_pass together with the http://nginx.org/r/resolver directive.

What this means is that the resolution of the name may become stale and incorrect, resulting in the pages no longer loading.

The solution would be to use variables within proxy_pass, as well as specifying a resolver to use for ongoing resolutions.

Otherwise, the error log should still be useful to provide information on what's the cause of the downtime; make sure you're looking at the global http://nginx.org/r/error_log, not the error_log of the individual servers, which often won't have anything interesting in case of a serious issue affecting nginx as a whole.

NGINX randomly stops working, required manual restart

2 Answers2