0

My nginx keep crashing and reporting "bad gateway" errors in the browser. Nginx and PHP-FPM don't come preconfigured to handle large traffic loads. I had to put a systemctl restart php7.0-fpm cron job in place each hour just to make sure my sites don't stay down for too long when they go. Let's just get down to it.

Some errors I get from /var/log/php7.0-fpm.log:

[20-Sep-2017 12:08:21] NOTICE: [pool web3] child 3495 started
[20-Sep-2017 12:08:21] NOTICE: [pool web3] child 2642 exited with code 0 after 499.814492 seconds from start

[20-Sep-2017 12:32:28] WARNING: [pool web3] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 7 idle, and 57 total children

Nothing jumps out at me in the nginx log. If I leave it running for too long without restarting it (PHP-FPM), I will get gateway errors. I've tried following tutorials 3 times now tweaking settings but it's still no good. Right now I've got all kinds of settings probably way off but it never works either way I do it.

/etc/nginx/nginx.conf:

user www-data;
worker_processes auto;
pid /run/nginx.pid;

worker_rlimit_nofile 100000;

events {
        worker_connections 4096;
        use epoll;
        multi_accept on;
}


http {
        sendfile on;
        reset_timedout_connection on;
        client_body_timeout 10;
        send_timeout 2;
        keepalive_timeout 30;
        keepalive_requests 100000;
        tcp_nopush on;
        tcp_nodelay on;
        types_hash_max_size 2048;
        fastcgi_read_timeout 300000;
        client_max_body_size 9000m;
        include /etc/nginx/mime.types;
        default_type application/octet-stream;
        ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
        ssl_prefer_server_ciphers on;
        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log;
        gzip on;
        gzip_disable "msie6";
        gzip_vary on;
        gzip_proxied any;
        gzip_comp_level 6;
        gzip_buffers 16 8k;
        gzip_http_version 1.1;
        gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
        open_file_cache max=200000 inactive=20s;
        open_file_cache_valid 30s;
        open_file_cache_min_uses 2;
        open_file_cache_errors on;

        access_log off;
}

/etc/php/7.0/fpm/php-fpm.conf:

    [www]

    pm = dynamic
    pm.max_spare_servers = 200
    pm.min_spare_servers = 100
    pm.start_servers = 100
    pm.max_children = 300

    [global]
    pid = /run/php/php7.0-fpm.pid
    error_log = /var/log/php7.0-fpm.log
    include=/etc/php/7.0/fpm/pool.d/*.conf

/etc/php/7.0/fpm/pool.d/www.conf:

[www]

user = www-data
group = www-data
listen = /run/php/php7.0-fpm.sock
listen.owner = www-data
listen.group = www-data
pm = dynamic
pm.max_children = 300
pm.start_servers = 100
pm.min_spare_servers = 100
pm.max_spare_servers = 200
pm.max_requests = 500

One of my sites (/etc/php/7.0/fpm/pool.d/web3.conf):

[web3]

listen = /var/lib/php7.0-fpm/web3.sock
listen.owner = web3
listen.group = www-data
listen.mode = 0660

user = web3
group = client1

pm = dynamic
pm.max_children = 141
pm.start_servers = 20
pm.min_spare_servers = 20
pm.max_spare_servers = 35
pm.max_requests = 500

chdir = /

env[HOSTNAME] = $HOSTNAME
env[TMP] = /var/www/clients/client1/web3/tmp
env[TMPDIR] = /var/www/clients/client1/web3/tmp
env[TEMP] = /var/www/clients/client1/web3/tmp
env[PATH] = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Resource/proc usage from htop:

enter image description here

xendi
  • 414
  • 5
  • 10
  • 22

6 Answers6

7

The issue is with your database access. You have several MySQL processes using CPU, which indicates that database queries take long to execute.

You need to look into your application, looking for the following things:

  1. Database queries are properly optimised.
  2. Database design is efficient, and proper indexing is in place.
  3. Application has proper data caches in place.

The slow database queries then cause PHP-FPM to run out of available child processes which process the client requests. This will cause 502 Bad Gateway errors. You can try to increase pm.max_children setting for web3 pool, since that is causing the errors. This can remove scalability symptoms, but does not fix the root cause which is application / database inefficiency.

If you are not using the www pool, you can remove it to save the resources it uses.

The ideal setting for pm.max_requests is zero, that is, PHP workers should never be restarted. If your PHP workers don't leak memory due to bad coding of libraries, then you can use zero over there. Otherwise you can use whichever value that keeps the memory usage of the workers decent. There really isn't any other good advice to give regarding this setting.

There isn't that much you can do with nginx settings here, since it is the PHP-FPM that is not available sometimes. You could change gzip_comp_level to 1, which makes nginx spend a little less CPU compressing output. But this has really small effect compared to application optimisation.

Tero Kilkanen
  • 36,796
  • 3
  • 41
  • 63
  • The errors and problems I'm experiencing have nothing to do with MySQL. – xendi Sep 22 '17 at 10:02
  • Gateway timeout means that your PHP-FPM doesn't respond to the request in time. This indicates that something in your application code takes a long time to execute. Furthermore, the fact that MySQL is causing such a big load on the server indicates that it is the SQL queries that are taking such a long time so that one gets gateway timeouts. You should enable MySQL slow query log, and look what it tells you about the queries. – Tero Kilkanen Sep 22 '17 at 13:05
  • It's not MySQL. The large load on MySQL is due to some queries that yes, could be optimized but that don't cause errors. I know the exact lines of code responsible for the MySQL load and those pages load just fine. The gateway timeout errors happen instantly. There's no wait/delay and the MySQL queries being used where I'm getting errors are not causing any timeouts or extra load. Also, the gateway errors aren't confined to scripts using MySQL. Whenever this happens, all pages on all sites, no matter what scripts or content, error out with the gateway error. This happens to all pages and sites – xendi Sep 23 '17 at 13:05
  • Sorry, it's not gateway timeout. It's "Bad Gateway" – xendi Sep 23 '17 at 13:07
  • I added more text about the issue in my answer. – Tero Kilkanen Sep 23 '17 at 13:18
  • Ah. I guess that could make sense. What about also improving the settings I'm using as far as overall performance and best practices? Another answer says my `pm.max_requests` are too low. Speak to my Nginx and PHP-FPM settings as well and I'll flag the answer. – xendi Sep 23 '17 at 13:24
  • I made another update to the answer. However, Kismay's answer here regarding MySQL settings is also valid point. You should use InnoDB with separate tables as the DB engine, and make sure you allocate a good chunk of memory to it, and also optimise its settings. You should also check `iotop` on running system to see how much MySQL uses HDD. You should ask a question on MySQL settings over at dba.stackexchange.com. – Tero Kilkanen Sep 23 '17 at 16:06
1

(this should be a comment, but its a bit long)

my sites keep crashing

....is not a capacity issue unless your server is so badly configured that the oom killer is kicking in. And is not the error you've quoted from your logs.

Why do you have half a gig of swap on a box with 12 gig of RAM?

Your keepalive is too high.

You have disabled access logging (your logs are the place to start looking for capacity issues).

The top output hints at problems with mysql performance.

Your pm.max_requests is too low.

You've not capped the listen_backlog.

Everything you've shown us here has issues and its just the tip of the iceberg. Voting to close

symcbean
  • 21,009
  • 1
  • 31
  • 52
  • What do you mean by "It's the tip of the iceburg?" This was a default ispconfig install and these are the only files I've modified in relation to nginx and php-fpm. My swap isn't causing any of this. That's the highest Linode allows you to set the swap for some reason. I'll change it manually at some point soon. My top hints that I need to optimize some MySQL queries on one of my high-traffic sites (Already knew that). No offense but if you'd have only posted things related to my question, it would have fit in a comment. Only the access log is off. – xendi Sep 22 '17 at 10:00
  • The fact that you would vote to close a completely valid question is questionable. I asked a relevant question with details and I'm sure that it has an answer. If more information is required, then please ask me for it. – xendi Sep 22 '17 at 10:04
1

Is it the web3 site that is going offline? This log entry seems to be suggesting the cause:

[20-Sep-2017 12:32:28] WARNING: [pool web3] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers)

You've got really high values for start_servers / max_spare_servers for the www site, but much lower values for web3.

You don't seem to be out of memory, so giving mysql more may help. Unless your php app never queries mysql, leaving mysql out of your optimization process is a mistake.

To start, you'll want to look at your mysql config. I believe most distributions are fairly conservative in memory setup, and number of threads. Look for the mysql example configs, eg: my-large.cnf my-medium.cnf and compare them to yours. Debian based distros have them in /usr/share/doc/mysql-server-x.y/examples/ (where x.y is the major version)

When adjusting the various knobs, I'd recommend small adjustments. For example, change a value from 8M to 16M.

If it's your php app, you'll also want to look at slow query log as suggested by Tero Kilkanen's answer.

Hope that helps.

KIsmay
  • 115
  • 1
  • 7
0

In my experience especially with a large site is that php-fpm uses alot of processor power. this happens if there is no cache available and it has to wait for your page to load and render locally and then cache it then server the cache. I've had the same issues with a large sites before. the best thing to do is use httrack to crawl your site, set speed limits in httrack so not to overload your server. This will build your nginx cache then once the cache is built then you will see instant loading of pages and very little cpu or ram usage. the main cause really is down to page rendering that can be caused by to much JS or CSS or most likely to many SQL requests or a poorly configured sql database. make sure to index database tables that are used frequently.

SEO DEVS
  • 9
  • 4
0

htop appears to indicate each of the 15 PID's that are MySQL associated have used TIME of more than 1:nn.nn and each has at least 1G of VIRT RAM in use. Since you have 12 GB RAM in total, is it time for you to share with us your

SHOW GLOBAL STATUS;
SHOW GLOBAL VARIABLES;
SHOW ENGINE INNODB STATUS;

to allow some reasonable checks on your MySQL configuration, even though it is not a problem? Uptime of 1 day, 11 hours is encouraging.

Any idea what the PID 6148 was doing that has TIME of 28:+ invested in the effort?

From an earlier response today of @xendi .... "Whenever this happens, all pages on all sites, no matter what scripts or content, error out with the gateway error. This happens to all pages and sites"
have you looked at php.ini session.gc_maxlifetime = nnnn garbage collection seconds as being a possible cause?

09/24/2017 nginx.conf questions that may have an impact

client_max_body_size 9000m;    # really 9G in one body?
client_body_timeout 10;   # seconds to receive the client body seems short.
open_file_cache max=200000 inactive=20s;   may be causing churn at 20s

https://www.linode.com/docs/web-servers/nginx/configure-nginx-for-optimized-performance/     

possibly a helpful link.

Wilson Hauck
  • 472
  • 5
  • 11
0

The seems to be all about the memory.

Try to decrease the number of php servers and limit the memory of php and mysql server.

pbacterio
  • 276
  • 2
  • 6