0

Recently, my dedicated server froze for around 50 minutes on Sunday. It did not respond to ping or any command. In the end, it was hard rebooted by the hosting company and everything works fine since then.

I've been digging into the logs for two days now but I failed to found anything unsual except that my logs stopped between 10:55 and 11:40.

So, maybe I'm not looking in the right place or maybe I failed to log some critical information.


Which lead me to my question, how can I know why my dedicated server froze or crashed ? What should I log, where do I look, should I run some test ?


My server is running a Debian (Jessie) 8.3 but I omitted this information because I would prefer a "generic" answer that could be usefull for any Unix-like OSes.
Moreover, this question may be a little too broad, I'm aware of it and I apologize if it is.

1 Answers1

1

The situation when a server stops answering, and after a reset there's no decent explanations of a reason in its logs is pretty common. The standard approach to investigate this is having some sort of out-of-band control over this server, typically this would be some sort of ipkvm, usually provided by the IPMI/BMC board. HP calls it ILO, Dell calls it DRAC, IBM calls it RSA, other vendors simply call it IPMI. it's usually handled by a separate controller, which can have a dedicated network port (it can also be accessed in a shared mode, through the same network interface the OS is connected, but having a dedicated one is more preferable). Another option is attaching an external ipkvm, which will provide you the same way of out-of-band connectivity.

So when a server stop answering, you use this sort of communications, log onto a server and try to understand what is wrong. If the server is stil unresponsive, even through a local console accessing remotely, then some other, more complicated techniques may be attempted. The first would be entering into the kernel debugger using an NMI (Non-Masked Interrupt call, that could be issued usually from an IPMI/BMC) or even forcing a fatal trap and examining the dumped kernel core after a reboot. This latter technique is actually OS-specific, and is used only in upon a suspicion that a kernel bug is encountered. Since you are using Linux I doubt you will ever need it, however it's worth mentioning.

drookie
  • 8,625
  • 1
  • 19
  • 29