0

We recently suffered a complete server failure (no remote or even console connection).

I've found no errors, logs etc that shed light on what/why it happened.

The only odd thing about the machine, following reboot one file looks like this under ls -l

-rw-r--r--  1 root   root      705 Feb 14 15:30 filefoo
?---------  ? ?      ?           ?            ? filname
-rw-r--r--  1 root   root      705 Feb 14 15:30 filefoo

I can't rm the file. If I touch the file I end up with 2x normal looking files (though with identical names).. I can remove one which puts the other back to looking like this (with all the questionmarks).

This might be a nooob question but it's impossible to Google!

Thanks, J.

Jim Morrison
  • 103
  • 4

2 Answers2

1

Are you sure that you don't have a hardware failure on your hard drive or your memory ? Apparently your file system is corrupt. So do a memory test to be sure it works and then do a fsck on your hard drive. You can also install smartmontools to detect hard drive failures.

Dom
  • 6,743
  • 1
  • 20
  • 24
  • Thanks Dom, I don't want to duplicate-comment but as below - can I fsck on a running machine? I'll look at smartmontools right away! :) – Jim Morrison Feb 14 '12 at 20:04
  • As Bart says, it is not recommended. If it is a server, think it can continue to corrupt your datas ! So Think really to stop it and test the drive. If you can't stop it, create/buy a new server and restore your datas from a correct backup, then the production. After you will have time to diagnose the problem. – Dom Feb 15 '12 at 15:52
1

You might want to check the logs for indications of filesystem or disk damage, not just what caused the crash.

Reseat memory and cables.

Then run a full FSCK on your filesystems because this looks like filesystem corruption. You'll probably want to do it from a live boot disc.

If your drives check out and the controller seems fine etc. etc. you may need to restore from backup.

Bart Silverstrim
  • 31,172
  • 9
  • 67
  • 87
  • Thanks Bart. I should have said; this is a live production box that I can't really take down for any length of time nor have any physical access to. Presumably I need to unmount everything to run fsck? Currently looking at just moving everything to a new box - only had it for 4 months... Would you suggest that's best? – Jim Morrison Feb 14 '12 at 20:03
  • fsck on a mounted, active filesystem is very dangerous, so yes, you'd have to unmount it to have a decent chance of not losing/damaging more data. If you can't take this down to repair it for a maintenance window and you have the option to restore data to a new server then you should do that. – Bart Silverstrim Feb 14 '12 at 20:46