3

I have a web server that occasionally stops working entirely. There is no spike in web requests, CPU usage, memory usage, disk usage, or network usage leading up to the crash. Just all of the usage graphs suddenly drop to 0, and the server becomes basically unreachable. I can still ping the server, and in fact I can get a connection on ports 80 and 22, but I never get any response other than a ping response.

Rebooting the server causes a full recovery. This kind of crash happens on about an 18-36 hour interval. This is a virtual machine running Ubuntu 11.04 (with stock PHP 5.3, Apache, JVM) on Amazon's EC2. I've created dozens of servers with the same result, so it's not a hardware issue. I've tried rebuilding my server image from scratch with Ubuntu 10.10, and it had no effect.

What can I try to diagnose this issue?

EDIT, further details: I have a cron job running as root once per minute that logs the output of the detailed Apache status (which URLs are being run, for how long, etc). The last log before the crash looks normal, and the cron job doesn't even run once the crash happens (according to /var/log/auth.log).

EDIT, for clarity: I can telnet to port 22, but not SSH to it. I can telnet to port 80, but there is no response at all to an HTTP GET.

Ben Dilts
  • 265
  • 4
  • 10

3 Answers3

2

You have a problem with the Java application. Make 2-3 thread dumps with kill -3 <jvm_pid>. You can find the thread dumps in /proc/<jvm_pid>/fd/1 file. Send the thread dumps to the Java developer to search for stuck or locked threads.

Same thing can happen with PHP too. Check the apache status to see how many connections you have and in which state and on which page they are.

Edit: As an ugly workaround you can restart java process instead of restarting the VM.

Mircea Vutcovici
  • 17,619
  • 4
  • 56
  • 83
  • I can't SSH into the server to do that. I suppose I could set up a cron job to dump the Java threads, and hope it keeps running during the crash. However, I already have a cron job set to dump the apache status once per minute, and it doesn't operate after the crash. Worse, the log less than 1 minute before the crash doesn't show any unusual activity in Apache. (The Java app is just a conversion process run from the shell, not Tomcat or anything long-running). – Ben Dilts Aug 09 '11 at 16:12
  • Then may be you VM is swapping. Make sure that MaxSpareServers * RSS_of_apache is less than 90% of the free RAM. Lower MaxSpareServers. Monitor you VM for memory usage. Check the disk I/O. – Mircea Vutcovici Aug 09 '11 at 16:27
  • Add more details about the application. – Mircea Vutcovici Aug 09 '11 at 16:27
  • I don't think VM is swapping, since the disk usage is at 0% right along with CPU usage as soon as the machine stops responding. That also wouldn't explain the failure of a 1-minute-interval cron job to even attempt to start even when I let the crashed server alone for hours. – Ben Dilts Aug 09 '11 at 16:34
  • How do you monitor the VM? Is it from inside VM, remotely with SNMP, or from the host? When you telnet the SSH port do you see the SSH banner? – Mircea Vutcovici Aug 09 '11 at 16:43
  • Amazon's EC2 is monitoring the VM for some basic stats like disk IO, network IO, CPU usage, etc., from outside the virtual machine (which is running in Xen, I believe). When I attempt to SSH in, I don't get a single byte back from the server, but I do get a TCP connection opened. – Ben Dilts Aug 09 '11 at 16:55
  • If the TCP connection is in established state, but you are not receiving data back, this means that the kernel is answering, sshd is started and listening on port 22 and that sshd is stuck. This could happen if your VM is trashing. Try to disable or shrink the swap. Connect on the console and try to see what is happening from the console. – Mircea Vutcovici Aug 09 '11 at 16:56
  • @MirceaVutcovici let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/1035/discussion-between-ben-dilts-and-mircea-vutcovici) – Ben Dilts Aug 09 '11 at 17:01
1

You should check out sar - hopefully it's already running and gathering lots of system stats every few minutes.

Here's some info on enabling sar on ubuntu.

Once it's enabled you can run sar -A to see the stats that have been collected. Hopefully there is some info in there that will point you in the right direction, for example it should show if your machine is suddenly using lots of virtual memory.

dmesg output can be tremendously helpful here too - maybe a weird driver issue is causing the machine to go unresponsive?

Phil Hollenback
  • 14,947
  • 4
  • 35
  • 52
0

Do you have nscd installed and in use? in the past it did cause such weird freezes for me if nscd died but left its pid around.

Janne Pikkarainen
  • 31,852
  • 4
  • 58
  • 81