0

Every couple of days my server suddenly crashes and I must request hardware reset at data center to get it back running.

Today I came back to my shell and saw the server was dead and "top" was running on it, and see below for the "top" right before the crash.

I opened /var/log/messages and scrolled to the reboot time and see nothing, no errors prior to the hard reboot. (I checked in /etc/syslog.conf and I see "*.info;mail.none;authpriv.none;cron.none /var/log/messages" , isn't this good enough to log all problems?)

Usually when I look at the top, the swap is never used up like this! I also don't know why mysqld is at 323% cpu (server only runs drupal and its never slow or overloaded). Solver is my application. I don't know whats that 'sh' doing and 'dovecot' doing.

Its driving me crazy over the last month, please help me solve this mystery and stop my downtimes.

top - 01:10:06 up 6 days, 5 min,  3 users,  load average: 34.87, 18.68, 9.03
Tasks: 500 total,  19 running, 481 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 96.6%sy,  0.0%ni,  1.7%id,  1.8%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8165600k total,  8139764k used,    25836k free,      428k buffers
Swap:  2104496k total,  2104496k used,        0k free,     8236k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                            
 4421 mysql     15   0  571m 105m  976 S 323.5  1.3   9:08.00 mysqld                                                                                                                                                                                            
  564 root      20  -5     0    0    0 R 99.5  0.0   2:49.16 kswapd1                                                                                                                                                                                            
25767 apache    19   0  399m 8060  888 D 79.3  0.1   0:06.64 httpd                                                                                                                                                                                              
25781 apache    19   0  398m 5648  492 R 79.0  0.1   0:08.21 httpd                                                                                                                                                                                              
25961 apache    25   0  398m 5700  560 R 76.7  0.1   0:17.81 httpd                                                                                                                                                                                              
25980 apache    25   0 10816  668  520 R 75.0  0.0   0:46.95 sh                                                                                                                                                                                                 
  563 root      20  -5     0    0    0 D 71.4  0.0   3:12.37 kswapd0                                                                                                                                                                                            
25766 apache    25   0  399m 7256  756 R 69.7  0.1   0:39.83 httpd                                                                                                                                                                                              
25911 apache    25   0  398m 5612  480 R 58.8  0.1   0:17.63 httpd                                                                                                                                                                                              
25782 apache    25   0  440m  38m  648 R 55.2  0.5   0:18.94 httpd                                                                                                                                                                                              
25966 apache    25   0  398m 5640  556 R 55.2  0.1   0:48.84 httpd                                                                                                                                                                                              
 4588 root      25   0 74860  596  476 R 53.9  0.0   0:37.90 crond                                                                                                                                                                                              
25939 apache    25   0  2776  172   84 R 48.9  0.0   0:59.46 solver                                                                                                                                                                                             
 4575 root      25   0  397m 6004 1144 R 48.6  0.1   1:00.43 httpd                                                                                                                                                                                              
25962 apache    25   0  398m 5628  492 R 47.9  0.1   0:14.58 httpd                                                                                                                                                                                              
25824 apache    25   0  440m  39m  680 D 47.3  0.5   0:57.85 httpd                                                                                                                                                                                              
25968 apache    25   0  398m 5612  528 R 46.6  0.1   0:42.73 httpd                                                                                                                                                                                              
 4477 root      25   0  6084  396  280 R 46.3  0.0   0:59.53 dovecot                                                                                                                                                                                            
25982 root      25   0  397m 5108  240 R 45.9  0.1   0:18.01 httpd                                                                                                                                                                                              
25943 apache    25   0  2916  172    8 R 44.0  0.0   0:53.54 solver                                                                                                                                                                                             
30687 apache    25   0  468m  63m 1124 D 42.3  0.8   0:45.02 httpd                                                                                                                                                                                              
25978 apache    25   0  398m 5688  600 R 23.8  0.1   0:40.99 httpd                                                                                                                                                                                              
25983 root      25   0  397m 5272  384 D 14.9  0.1   0:18.99 httpd                                                                                                                                                                                              
  935 root      10  -5     0    0    0 D 14.2  0.0   1:54.60 kjournald                                                                                                                                                                                          
25986 root      25   0  397m 5308  420 D  8.9  0.1   0:04.75 httpd                                                                                                                                                                                              
 4011 haldaemo  25   0 31568 1476  716 S  5.6  0.0   0:24.36 hald                                                                                                                                                                                               
25956 apache    23   0  398m 5872  644 S  5.6  0.1   0:13.85 httpd                                                                                                                                                                                              
18336 root      18   0 13004 1332  724 R  0.3  0.0   1:46.66 top                                                                                                                                                                                                
    1 root      18   0 10372  212  180 S  0.0  0.0   0:05.99 init                                                                                                                                                                                               
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.95 migration/0                                                                                                                                                                                        
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/0                                                                                                                                                                                        
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                                                                                                                                                                                         
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.15 migration/1                                                                                                                                                                                        
    6 root      34  19     0    0    0 S  0.0  0.0   0:00

.06 ksoftirqd/1

here is a normal top, when server is working fine:

top - 01:50:41 up 21 min,  1 user,  load average: 2.98, 2.70, 1.68
Tasks: 271 total,   2 running, 269 sleeping,   0 stopped,   0 zombie
Cpu(s): 15.0%us,  1.1%sy,  0.0%ni, 81.4%id,  2.4%wa,  0.1%hi,  0.0%si,  0.0%st
Mem:   8165600k total,  2035856k used,  6129744k free,    60840k buffers
Swap:  2104496k total,        0k used,  2104496k free,   283744k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                            
 2204 apache    17   0  466m  83m  19m S 25.9  1.0   0:22.16 httpd                                                                                                                                                                                              
11347 apache    15   0  466m  83m  19m S 25.9  1.0   0:26.10 httpd                                                                                                                                                                                              
18204 apache    18   0  481m  97m  19m D 25.2  1.2   0:13.99 httpd                                                                                                                                                                                              
 4644 apache    18   0  481m 100m  19m D 24.6  1.3   1:17.12 httpd                                                                                                                                                                                              
 4727 apache    17   0  481m  99m  19m S 24.3  1.2   1:10.77 httpd                                                                                                                                                                                              
 4777 apache    17   0  482m 102m  21m S 23.6  1.3   1:38.27 httpd                                                                                                                                                                                              
 8924 apache    15   0  483m  99m  19m S 22.3  1.3   1:13.41 httpd                                                                                                                                                                                              
 9390 apache    18   0  483m  99m  19m S 18.9  1.2   1:05.35 httpd                                                                                                                                                                                              
 4728 apache    16   0  481m 101m  19m S 14.3  1.3   1:12.50 httpd                                                                                                                                                                                              
 4648 apache    15   0  481m 107m  27m S 12.6  1.4   1:18.62 httpd                                                                                                                                                                                              
24955 apache    15   0  467m  82m  19m S  3.3  1.0   0:21.80 httpd                                                                                                                                                                                              
 4722 apache    15   0  503m 118m  19m R  1.7  1.5   1:17.79 httpd                                                                                                                                                                                              
 4647 apache    15   0  484m 105m  20m S  1.3  1.3   1:40.73 httpd                                                                                                                                                                                              
 4643 apache    16   0  481m 100m  20m S  0.7  1.3   1:11.80 httpd                                                                                                                                                                                              
 1561 root      15   0 12900 1264  828 R  0.3  0.0   0:00.54 top                                                                                                                                                                                                
 4434 mysql     15   0  496m  55m 4812 S  0.3  0.7   0:06.69 mysqld                                                                                                                                                                                             
 4646 apache    15   0  481m 100m  19m S  0.3  1.3   1:25.51 httpd                                                                                                                                                                                              
    1 root      18   0 10372  692  580 S  0.0  0.0   0:02.09 init                                                                                                                                                                                               
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.03 migration/0                                                                                                                                                                                        
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0                                                                                                                                                                                        
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                                                                                                                                                                                         
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/1                                                                                                                                                                                        
    6 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1                                                                                                                                                                                        
    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                                                                                                                                                                                         
    8 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/2                                                                                                                                                                                        
    9 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2                                                                                                                                                                                        
   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2                                                                                                                                                                                         
   11 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/3                                                                                                                                                                                        
   12 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3                                                                                                                                                                                        
   13 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/3                                                                                                                                                                                         
   14 root      RT  -5     0    0    0 S  0.0  0.0   0:00.03 migration/4                                                                                                                                                                                        
   15 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/4                                                                                                                                                                                        
   16 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/4                                                                                                                                                                                         
   17 root      RT  -5     0    0    0 S  0.0  0.0   0:00.02 migration/5                                                                                                                                                                                        
   18 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/5                                                                                                                                                                                        
   19 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/5                                                                                                                                                                                         
   20 root      RT  -5     0    0    0 S  0.0  0.0   0:00.01 migration/6                                                                                                                                                                                        
   21 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/6                                                                                                                                                                                        
   22 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/6                                                                                                                                                                                         
   23 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/7    
Alex
  • 29
  • 3
  • The `top` output isn't as useful here as the kernel is already eating up all of the CPU in order to page memory to the swap partition/file. If you can get `sar` data you might be able to get information on the system health leading up to these events. Otherwise we won't be able to do much. – Jodie C Jul 03 '11 at 05:05
  • In the first top it appears the the user apache is opening a shell. The sh stands for the bourne shell. Why is apache opening a shell? Are you maybe compromised? – Nunya Jul 03 '11 at 13:53
  • Apache is probably opening a shell for his `solver` software. – Jodie C Jul 03 '11 at 20:00
  • You might start by running some consistency checks against your MySQL database(s). – user48838 Jul 03 '11 at 05:02

7 Answers7

7

My guess is that your system is swapping itself to death because of waiting web requests when the database locks. You probably have one or two queries that run sporadically - possibly from a cronjob - that cause one of the database tables that is frequently used to lock. Once it does, all the queries start backing up behind it until the system starts swapping. Once that starts happening, it's the end of it.

Check your slow log and check for periodic queries that run within a few hours of when the crashes usually occur.

Scrivener
  • 3,116
  • 1
  • 22
  • 24
  • In addition to reviewing the slow log, you may want to use mytop if it is installed. It can give you ongoing information about what queries are running and how long they are taking. I suspect that you have one or more queries that are locking tables during a long running select. – Rik Schneider Jul 03 '11 at 08:17
3

Take a look at your apache config.

You want to limit the maximum number of your apache processes to a number that fits in memory without swapping.

If you start swapping (which you have), your apache is going to run like an utter dog. At which point, any new connection will cause apache to spawn even more children (as your current children are all busy).

If the number of apache processes which fit in memory are not enough to service your requests, you need more memory or to optimise the application. The first point to look then is your mysql queries. Check indexes. Any slow query is going to become a bottleneck around which all your apache processes will synchronise - i.e. if your slowest query takes 1 second, and you only have 5 apache processes that can fit in memory, then you are not going to be able to handle more than 5 queries a second.

Mike.

1

The top output indicates that you're running out of memory. Nothing in the top (CPU) users that you've posted is the culprit. While you could leave a top running in memory-sorted mode (press capital-M in top to switch), you're far better off collecting the data to disk for later analysis. While sar is useful in general, it's no good for per-process stuff; for that, you need pidstat (in the same package as sar -- sysstat on Debian). Unfortunately pidstat lacks some of the bulk data collection niceties of sar, but it's not hard to cobble something together that'll get the necessary data onto disk for later perusal.

womble
  • 96,255
  • 29
  • 175
  • 230
1

Your load average on regular usage is a bit high. 1.68? It's not a good number for an interactive server.

You went from 271 to 500 processes. There's nothing in top which shows where the memory is going, but I suspect that you have a lot of processes taking up a fraction of a percent of the total RAM. E.g., Apache processes.

It looks like something is creating a CPU bottleneck which is causing requests to pile up until you run out of RAM. It could be a cron job, or maybe it's the app itself is CPU hungry.

I would wager given the 1.68 average load on regular usage, and the high load on MySQL, there's a sub-optimal query going on.

If the app can limit the number of simultaneous users it might be a lame (but effective) way to temporarily handle the situation. Start logging with sar and turn on the slow query log.

All this said... this is what sysadmins get paid to fix. It's probably not as easy as looking at the output of top and prescribing a solution. The money you spend getting your developer to fix it, would probably be better spent up-front on a sysadmin to have a close look at the system performance.

(dovecot is an imap/pop3 server. The sh on apache is suspicious, but your developer could look at that.)

mgjk
  • 874
  • 3
  • 9
  • 20
  • Probably true (+1) however a load average is only "high" when its more than around half the number of cpus. Alex told didn't tell us how many cpus there are. – symcbean Apr 07 '17 at 22:23
  • Hey @symcbean, this was back in 2012. The 1.68 was his load number under ideal circumstances. He was indicating poor performance when he had a busy load number of 34.87, but by then, his swap is also exhausted and swapd is pegged. What I was trying to say was that if you're running a bursty application with a baseline of 1.68, you're probably under spec for peak load. My guess was that under a burst, the system can't service the requests fast enough, requests pile up then it runs out of memory. sar would show a hint as to the sequence of resource exhaustion. – mgjk Apr 08 '17 at 09:34
0

Use mytop or any other utility, which will show you the list of current running queries in MySQL, than try to find out where you use than. Looks like mysql runs smth heavy and server starts to swap data, at the same time apache handles old and new queries and starts to swap them too

Maxim
  • 1
0

atop is what you need. http://www.atoptool.nl/. Available wherever fine distributions are not sold. ;)

Be sure to enable the included daemon in your init/systemd boot levels. Then, you can use

atop -r /var/log/atop/<rawfile>

to view a supercharged top interface that can MOVE BACK AND FORTH IN TIME with 't' and 'T', and aggregate resources used for all programs running with the same name (aka 'apache' or 'httpd'). Very useful to see what ate your RAM and SWAP killing your box. Among MANY OTHER THINGS.

atop truly is 'a better top'. I don't know why more people don't use it.

Jesse Adelman
  • 978
  • 5
  • 15
0

I agree with Mike that your webserver config is currently accepting more connections than it has the capacity to handle. Limiting the number of requests is part of the solution to protect your system but to preserve the availability of your service you need to investigate the causes of the traffic buildup (lots of log analysis) and reduce the number of resident requests. The latter is done with tuning your keepalives, caching and database optimization - more log analysis. There are also a lot of checks you should be running against your os - ensuring that you are using the right mount options, io scheduler, irq balancing etc.

Check that mysql is not configured to use an excessive amount of memory using mysqltuner.pl and that memory overcommit is disabled.

Ultimately adding more hardware capacity may be necessary, but often is cheaper to start with this rather than spending time and money identifying and fixing the problem, particularly if you operating on a very small scale.

symcbean
  • 21,009
  • 1
  • 31
  • 52