0

My staging server has Mac OS X 10.6.4 (not the server OS version) and has copies of my last build I cannot afford to lose (human error late last night).

Bad luck always comes on a tandem and rebooting the server this morning led to a screen with nothing on it and a blue-ish background (the hue one sees for a split second just before the background picture is loaded to accompany the Mac OS X login screen). After trial an error I got it to a point in which I can SSH into it and I don't really want to try my luck and use the disk unnecessarily or reboot the system too many times. I can browse the filesystem using bash.

What are ALL the places I should look for potential impending HDD failure? Clues as to where to look for general boot-time troubleshooting are appreciated as well.

I am hesitant to start a down-rsync to recover the build (lots of data) but I would do so immediately and just rebuild the machine if I had a way to make sure the HDD is actually fine and it was just a configuration problem that impaired the usual loading of the login screen.

Thanks a lot in advance! Come save the day! :)

pauska
  • 19,620
  • 5
  • 57
  • 75
Maroloccio
  • 125
  • 5
  • Is there any chance the monitor cable is loose? Or the monitor is bad? – Dennis Williamson Jun 28 '10 at 12:50
  • Hi Dennis, thanks for the comment. It isn't anything to do with the monitor. That's ruled out. I am really interested in a list of "*.log" or similia that I could peek into to ascertain the health of the hard disk... – Maroloccio Jun 28 '10 at 13:54

4 Answers4

1

If you can copy the data you need to an external drive, I'd do that (backups of the server available?)

Otherwise, you can try booting from the install CD and running disk utility to check the drive and get the status of SMART, or boot into target disk mode and connect with another Mac to run the disk check from there.

Any repair you run risks losing data in the process. If you can copy anything you need to an external disk first, you will want to do that. Otherwise you will probably end up having to format the volume and restore.

Disk utility can tell you the SMART status but it's not 100% reliable as an indicator. Even reformatting isn't completely reliable if there's an iffy sector on the drive.

Best protection I could recommend would be disk mirroring plus a good backup routine.

Bart Silverstrim
  • 31,172
  • 9
  • 67
  • 87
  • Hi Bart, thanks for answering. Copying the data will mean a lot of drive access so before doing that I would need to build a priority list of files and run things "in order", in case the drive should fail on me in the midst of copying less important files instead of critical ones. I do not have such an "importance list" ready, and it would probably take me a day to build one, so peeking first at log files to see if the hard drive is actually fine seems like the most logical start. I am here to learn though and I will gladly do things differently if there were a better way... – Maroloccio Jun 28 '10 at 14:06
  • If it's a mechanical failure, just having power going to it will kill it. If it's a bad sector, it's going to be there no matter what, just will crash or die when that sector is encountered. The data is either recoverable or it isn't, but more or less, if the drive is dying just having power going to it will kill it. On the other hand if it's just corruption then it's fine to copy your data off. Just start copying. I take it there's no backup or Time Machine? – Bart Silverstrim Jun 28 '10 at 14:40
  • No backup nor Time Machine because this was just a staging server, not meant to keep important data. It just so happens that I made a mistake the night before AND the server failed the next day. Streak of bad luck + user error. I disagree that using the drive a lot (copying various GBs of many small files) and just keeping it online are the same thing as far as the remaining life in a drive about to fail is concerned. I am pretty sure that any such heavy access stresses the hardware even more, potentially causing an unrecoverable failure. – Maroloccio Jun 28 '10 at 22:52
1

What Mac model are you using? If it has a Firewire port, you can set it into target mode (press T during startup until the chime sounds) and then connect it to another computer and basically use the broken system as a big external firewire drive.

After that, you can try your luck with smartmontools or something like that to find out if the disk itself has problems, or if it is a logical problem affecting the boot process.

Anyway, I would try to rescue my data first (in readonly mode), because the situation isn't getting better if you analyze first and the the disk gets worse during this.

Sven
  • 98,649
  • 14
  • 180
  • 226
  • Thanks Sven, I did not know about "smartmontools", so far I limited myself to just keeping an instance of SMARTReporter running in my menu bar, but I just installed it (on my laptop, not my failing disk) and I like it very much. I intend to add that to my list of MacPorts to deploy onto any new system so as to have something so handy available should the situation occur again. As far as the current crash is concerned, I had rather not do as you advise and try to look at things from this SSH session. "No rebooting if possible" was listed in my question. I appreciate your answer though! – Maroloccio Jun 28 '10 at 14:00
  • Smartmontools are used to read the SMART status of the drives from the command line. If you boot from DVD or from another Mac or from an external drive with the correct version of OS X installed, the drive utility will tell you the SMART status of the drive as well. – Bart Silverstrim Jun 28 '10 at 16:02
1

At this point, I don't think you'd want to shut it down, as if it is a hard drive problem, there's a chance it might not come back up.

However, sitting a long time at the blue screen just means that something in the bootstrapping is taking a long time -- if the machine wasn't shut down cleanly, it might've decided to fsck the disks, which can take a while if you have a lot of storage attached.

I think everything sent to the console during bootup is reported in /var/log/system.log, but I'm not 100% sure.

When you're rebooting, you can either hold down cmd-V from the console for 'verbose mode' (it'll show the console messages, rather than just the blue screen), or you can force it to always use verbose mode:

First, check the current settings using :

nvram -p | grep boot-args

If it's not set, it's safe to do:

sudo nvram boot-args="-v"

If it's already set to something, you'll likely want to add '-v' to the current set of args.

Joe H.
  • 1,917
  • 12
  • 13
  • Joe, I learned 2 things from your answer: 1) I did not know one could set nvram parameters from the command line - thanks for that. 2) I did not know it was possible to "always boot in CMD+V mode" (yet I knew about CMD+V) - thanks for that too. Thanks for teaching me these things, I would vote you up / accept but apart from these tips your answer does not help me in the current situation... – Maroloccio Jun 28 '10 at 22:46
0

From your description I'm inclined to think your hard drive is already screwed. I recommend you pull it out, fit it to another machine and use whatever recovery software you choose to grab as much of the file system as possible. Even if it later turns out the drive is fine you'll at least have the files before running any risky operation that might result in total loss.

John Gardeniers
  • 27,458
  • 12
  • 55
  • 109
  • I will do this as mounting the drive on another system is actually, as you suggested, likely to cause the least amount of additional access. Rebooting might see a fatal fsck kick in (in case I am to see the last spins of this device), and is for certain more involved than just being spun up and mounted by an external OS instance. I now have a list of files I need to copy off of it ordered by importance - i.e. I will make those last spins count and recover as much important data as I can. – Maroloccio Jun 28 '10 at 22:58