1

I'm a web developer and I am having a very nasty problem with one of the websites I'm working on and I was hoping somebody here may be able to help me.

The website is running on a dedicated server with CentOS 6.6, an Nginx + Apache configuration with Vesta as control panel. I'm not sure if it's very relevant here, but the website uses Sphinx as search engine.

Since mid January every 6 days the server crashes, every time at different hours. The recovery usually takes around an hour and 15 minutes time during which there are no entries in any of the log files.

After the server recovery, 2 of the largest tables in the MySQL database throw duplicate key errors and because they are too large to repair quickly enough I usually truncate them and then restore them from backups.

  • I checked all the logs and I could not find any hints/relevant information about the crash. All the logs contain normal entries until the time of the crash and then resume after the server recovery.

  • I checked all the crons for all the users and there is none that runs every 6 days.

  • the CPU and memory usage before the crashes is very low: 1.6% CPU and 16.5% memory, which is the usual load on the server.

  • for about a week I suspected that the Vesta backup cron was somehow related to this, as it was keeping the memory usage at 74% at all times, even after it finished running, but I disabled it and apart from the decrease of memory usage, the crash is still there.

Do you have any advice on what I can do to identify the culprit? I've ran out of ideas.

Thanks!

PS: If you need me to provide other information, please let me know!

Fallen
  • 13
  • 4
  • Have you checked `dmesg` output? – manish.in.java Mar 10 '15 at 14:58
  • What should I look for in dmesg? From what I see it's full of Firewall: *TCP_IN Blocked* and Firewall: *UDP_IN Blocked* – Fallen Mar 11 '15 at 11:19
  • [Administration panels are off topic](http://serverfault.com/help/on-topic). [Even the presence of an administration panel on a system,](http://meta.serverfault.com/q/6538/118258) because they [take over the systems in strange and non-standard ways, making it difficult or even impossible for actual system administrators to manage the servers normally](http://meta.serverfault.com/a/3924/118258), and tend to indicate low-quality questions from *users* with insufficient knowledge for this site. – HopelessN00b Mar 11 '15 at 15:51
  • I noticed you marked my question as off-topic. I thought of posting it here because I thought this is a server-related problem. I had no idea Administration panels are considered off-topic by sysadmins. Anyway, the question remains valid, and if this is not the right place, maybe you can offer me a hint about where I can post it? My second choice would've been StackOverflow, but that's a programming community, so I'm not sure it would fit either. – Fallen Mar 11 '15 at 16:44
  • `dmesg` output will tell you if the crash was due to the failure of a critical component, such as, CPU overload, out-of-memory, etc. If it is full of connection block errors, can you check that your server is not a victim of a DDoS attack? Those can also bring down servers. – manish.in.java Mar 12 '15 at 07:47

1 Answers1

0

Try to collect metrics and graph them. Nothing beats graphs. A tools like Munin can be very helpful in these situations look at memory, io, processes, cpu, networking, interrupts, etc. over time.

http://munin-monitoring.org/

Also if your machine is a VM and has a network filesystem that becomes unavailable, that may explain the gap in the log times (for extra points try logging remotely).

  • Thanks for the suggestion! So far I've been checking the CPU, memory and disk usages with sar. VestaCP has some graphs too, but they don't provide much detail. – Fallen Mar 11 '15 at 11:24