-2

I have an Ubuntu 13.04 server. Today I found the box had crashed. I restarted it, and now every single hard drive's partition table is missing. (1 SSD for /boot, /, and 3 2TB drives for RAID).

I have the SSD connected to a laptop VIA USB->SATA cable, and sure enough, the partition table is missing. This tells me that the Motherboard / SATA controller / software actually broke the drives, not that they just can't be read correctly.

Something similar happened to only the SSD a few months ago, and I was forced to just re-partition it.

How the heck could his have happened? Bad Motherboard or SATA controller?

Mxx
  • 2,362
  • 2
  • 28
  • 40
Taylor
  • 101
  • 2
  • Depending on how you're doing the RAID, it's quite possible that the drives are not supposed to have a partition table. Can you elaborate more on the type of RAID (hardware/software, etc.)? Be sure to consider the possibility that the drives are not being assembled within the array correctly. (i.e. drives got jumbled between channels somehow) – Andrew B Jun 29 '13 at 20:01
  • What were you using for the RAID and why wasn't the SSD in a fault tolerant config? Were you using `md` or `ZFS` or a RAID controller or something like the crappy Intel Storage Manager baked into cheapo motherboards? – MDMarra Jun 29 '13 at 20:44
  • 3
    Cosmic rays ... – user9517 Jun 29 '13 at 21:39
  • What type of drive are they? If they're WD Green (or equivalent), then it's entirely plausible. – Tom O'Connor Jun 30 '13 at 06:47
  • I had it some days ago. I've installed older PCI Promise raid controller with 4 HDs. After creation of two raid partitions and formatting everything looked OK (e.g after 2 days). However, when I started to move data from old server I found that most of directories were empty. Also after reboot mdadm started to re-synchronize RAID partitions. Fortunately for me these were not the boot and root so I could check the messages. I found the info that the IRQ 185 was "removed". I don't need to say that this was IRQ of Promise. The most funny was info from kernel "deleyed allocation failed.... This sh – tatus2 Jun 29 '13 at 21:43
  • The issue isn't with the RAID, it's with every one of my HDs. And yes, the 2TB drives were WD Green HDs, but it also happened to my OCZ 60GB SSD. – Taylor Jul 02 '13 at 13:57

3 Answers3

1

This could have been bad/corrupted memory. Considering running memtest86+ to find out for sure.

Mxx
  • 2,362
  • 2
  • 28
  • 40
1

It doesn't need to be the disks per se but it could be faulty memory, faulty raid controller or a bug. Make sure you are on the latest firmware.

Skyhawk
  • 14,200
  • 4
  • 53
  • 95
Lucas Kauffman
  • 16,880
  • 9
  • 58
  • 93
0

There could be a lot of reasons. Data corruption doesn't need to be caused by just the things you've outlined, and doesn't imply a broken drive or that anything broke it.

The SSD's firmware could be bad. The controller (old, new, or both) could be bad. There could be a root or kernel process which was running in bad memory and overwrote the beginnings and ends of the drives. The CPU might even be bad. It's also possible that all the drives are actually bad (this doesn't happen often, but it does happen sometimes). If you are using software RAID with LVM, you might have upgraded to a buggy version or something, or just encountered a random bug.

The best thing to do is to take a bytewise image of any drive you need to recover data from, and manipulate that. Write it onto another drive, write a partition table exactly the same as the one that you expected to see there, and try mounting. Copy where you expect the filesystem to be on the drive, and mount it using a loop device. Use data recovery software of some kind. However, the easiest thing to do is restore from backup.

It isn't immediately clear what kind of hardware you are using. I would nonetheless run a full hardware test on the server (at least memtest, but do HDD tests and a CPU test if you have a capable test suite). Test the drives on the controller and on another controller if you can, and check their SMART status. Update everything related to the drives (the filesystem drivers, kernel, and LVM if it is in use, particularly). If you have a hardware RAID device, consider upgrading its firmware.

I have had this issue caused by several faulty RAID controllers in the past too. If it has blanked parts of multiple drives, get an RMA for it and put in a new one.

Falcon Momot
  • 25,244
  • 15
  • 63
  • 92