42

So, let's say your server had 6 healthy hard drives. A drive fails (will not mount/detect, drops out of raid with errors) or is failing (SMART getting worse, etc). You need to swap out the bad drive. When you open the case you see.. six identical hard drives.

How can you tell which one is no longer healthy/mounting/functioning?

System would be linux, most likely ubuntu server, using at most simple software RAID. The hard drives would be SATA and connected directly to the motherboard. (no raid controller)

I don't want to randomly disconnect drives until I pick the correct one. The drives all appear identical to me; I imagine there is some common way to identify which drive is which that I am unaware of. Does anyone have any pointers/tips/best practices? Thanks!

EDIT: I had wanted this to be 'generalized' in a hand-wavy sort of way, but it just came off as 'incomplete' and 'horrible'. My bad!

masegaloeh
  • 18,236
  • 10
  • 57
  • 106
privatehuff
  • 1,089
  • 2
  • 10
  • 13
  • 4
    If you have to shut down the machine and figure out which hard drive is what, you should take the time while the machine is down to identify each harddrive and label it in some manner so this when this happens again, you don't have this issue. – Roy Rico Sep 10 '09 at 18:53
  • 2
    A "RAID (or whatever)"? Sounds like a user's loose inside the machine room. – romandas Sep 10 '09 at 18:58
  • I meant only to glibly imply that the specifics were not important beyond the need to identify which drive needs to be replaced :) – privatehuff Sep 10 '09 at 19:58
  • 1
    A proper server will tell you which drive by turning on the drive error indicator of the bad drive. – John Gardeniers Sep 10 '09 at 23:47
  • 10
    Man everyone is so quick to jump on this as being naieve... frankly I think it's a good question, one that I've had to deal with myself! – Mark Henderson Sep 11 '09 at 12:36
  • 1
    In fairness to them it was much worse before the edit – privatehuff Sep 11 '09 at 14:01
  • 2
    I'm curious if there, for hobby purposes, is possible to somehow construct (with soldering iron in hand and so on) drive signalling LEDs to identify them physically from within a random OS (when there's no decent server-grade disk/raid controller present to do their magic)... – Oskar Duveborn Oct 26 '10 at 18:03
  • Which OS runs on that server? – igustin Jan 28 '11 at 04:01
  • dd if=/dev/sda of=/dev/null - produice activiti on sda drive, and u can see that –  Apr 09 '13 at 12:03
  • Could you add a little more detail please? – slm Apr 09 '13 at 12:23
  • If the drive is functionally failed then this will produce activity, sure, which may flash an activity light depending on the chassis. Depending on the failure mode even this may not be possible. – Scott Pack Apr 09 '13 at 13:19
  • What kind of drives are they (SATA/SCSI/SAS)? What storage controller are you using? Which operating system? Is it hardware or software RAID? *Who* is telling you a drive is faulty? You really should provide some more info here... – Massimo Sep 10 '09 at 18:06

13 Answers13

33

I had this exact problem on a (tower) server just like you explain, and it was easy:

smartctl will output the serial number of the drive

Vendors sometimes ship their own specific tools, like hdparm, that will do the same.

So output the serial of the bad drive, and then use a dentist's mirror and a flashlight to find the drive.

On a rackmount you'll usually have indicator lights like other people have said, but I bet the same would apply.

Tom Ritter
  • 3,197
  • 5
  • 27
  • 30
  • Whoops...smartctl, not hdparm was the one I'm thinking of. I need to edit my answer to reflect that. – Bart Silverstrim Sep 10 '09 at 21:15
  • upvoted for reminding me of the right command :-) – Bart Silverstrim Sep 10 '09 at 21:15
  • 1
    hdparm -i shows me the serial numbers of my drives -- That may be a vendor-specific response, though – Ian Clelland Sep 10 '09 at 22:02
  • 2
    excellent! I can't try it now but it looks like this is the answer! I will now label my hard drives with the last N digits of their serial numbers (assuming this is unique, per server) in an place that is exposed while mounted. Also from googling the command looks to be "smartctl -i" – privatehuff Sep 11 '09 at 14:00
  • Note that this requires writing at least the end of serial number to each drive or bay in way it's visible while the system is in active use while everything is still okay, preferably at the build time. – Mikko Rantalainen Sep 16 '22 at 09:13
27

Putting stickers on drives (depending on the design of the tray) may not be feasible. By the time the drive dies, the stickers could be dried up and fallen off.

ledctl (from package ledmon) is really the way to go with this.

ledctl locate=/dev/disk/by-id/[drive-id]

or

ledctl locate=/dev/sda

will illuminate the drive fail light on your chassis for the specified drive. I provided two examples to illustrate that it doesn't matter HOW you identify the drive. You can use serial, name, etc... Whatever information is available to you can be used. The drives are referenced multiple ways under the /dev/ and /dev/disk/ path.

To turn the light back off, just execute it again, changing locate to locate_off like so:

ledctl locate_off=/dev/sda
UCS75
  • 371
  • 3
  • 2
  • 1
    `… verified to work with Intel(R) storage controllers (i.e. the Intel(R) AHCI controller) and have not been tested with storage controllers of other vendors (especially SAS/SCSI controllers).` I had no luck with AMD. – Serge Stroobandt Aug 20 '22 at 22:49
  • This only works if your hardware supports this in kernel. If the hardware is supported, you will have file called `locate` within the `/sys` hierarchy per supported device. Try `sudo find /sys -name "locate"` to find supported devices. – Mikko Rantalainen Sep 16 '22 at 09:08
8

If you have no locate light and can't easily find the serial numbers on the outside of the drives, sometimes this cheesy technique can help: create a LOT of activity on that specific drive and then look for the drive with the activity LED on solid. It's best to follow up with a more detailed check of the serial number, but this can help narrow the search.

E.g.:

# while true; do dd if=/dev/disk/by-id/scsi-drive-that-is-dying of=/dev/null; sleep 1; done

(The while loop is not technically needed, but it will keep things moving while you head to the data center. The "sleep 1" helps avoid the high CPU usage created by a fast loop if the "dd" fails due to say... the drive being disconnected.)

Steve Bonds
  • 1,014
  • 2
  • 12
  • 21
  • 2
    This is pure genius! It worked amazingly for me with a 8HDD hot swap case with basic “power” and “operation” LEDs. Thank you! – user3161330 Mar 11 '20 at 17:59
  • If the drive is completely dead, you can also do the opposite: look for the drive with *no* activity. A scrub/integrity-check can be a good way to generate activity. – SomeoneSomewhereSupportsMonica Dec 01 '21 at 10:08
  • 2
    I could not resist to write a fail proof [bash script](https://serverfault.com/a/1108701/175321) based on this excellent answer. – Serge Stroobandt Aug 20 '22 at 23:05
  • One could also do stuff like morse code: read 1 MB block for dot and 10 MB block for dash and use `sleep 0.5s` and `sleep 1.5s` for pauses. – Mikko Rantalainen Sep 16 '22 at 09:11
6

Usually you would have to hope that the connections are labeled in some fashion then work from the identity of the failed device. For example...and someone would have to comment to correct me...if you have two IDE channels, you have up to 2 drives on each, you could have sda, sdb, sdc, and sdd. If sdd failed it would be the second drive on the cable of the second IDE channel.

If it's SATA and like the system I have in the back room the ports are labeled for each of the sata drives. Again, drive lettering goes from a through whatever the drives go up to, starting at port 0 of the SATA connectors and moving up.

If there are any manufacturing differences, the dmesg |grep sd or dmesg|grep hd should yield some clues.

If you have the serial numbers available I think the hdparm command might give it to you in software so you can trace it that way. You might want to label the drives somewhere if that's the case so you don't have to worry about that when you find there's an issue.

...I knew there was another reason I preferred hardware RAID over software RAID...blinky lights. Really like the blinky lights.

EDIT: smartctl, not hdparm, gives the serial number. My bad.

Bart Silverstrim
  • 31,172
  • 9
  • 67
  • 87
5

Some drives expose a locate "file" in /sys into which you can echo a 1 for turning the locate indicator light on or 0 for off.

$ for light in $( find /sys -name "locate" ) ; do echo 1 > $light ; sleep 10 ; echo 0 > $light; done
crh
  • 51
  • 1
  • 1
4

For short answer -- "lsscsi" For Detailed answer -- "lshw -c disk" will show you the HDD and SATA ports in which those connected.

2

I could not resist to write a bash script based on Steve Bonds answer.

Unlike ledctl, it also works fine with non-Intel hard drive controllers.

#!/usr/bin/env bash
# https://serverfault.com/a/1108701/175321

if [[ $# -gt 0 ]]
then
    while true
    do
        dd if=$1 of=/dev/null >/dev/null 2>&1 || sudo dd if=$1 of=/dev/null >/dev/null 2>&1
        sleep 1
    done
else
    echo -e '\nThis command requires a /dev argument.\n'
fi
Chopper3
  • 101,299
  • 9
  • 108
  • 239
Serge Stroobandt
  • 385
  • 1
  • 5
  • 13
2

Six internal HDDS? If they are external, hot swap drives, the hot swap carrier likely has an error light to help you identify the bad drive. Also many Raid management programs have an option to flash the light on a particular drive to determine which is which. If they are all internal with no lights, then you are down to your RAID software telling you which IDs are good, and looking at the SCSI IDs, etc to to figure it out. If they are set to auto, then your RAID controller doc should tell you what order in the SCSI chain the IDs are assigned. Good Luck. Take a backup now while things are still running!

BillN
  • 1,503
  • 1
  • 13
  • 31
2

At the very least the RAID software/controller which told you about the failed drive should tell you which drive had failed (id number). 0 is usually the one on the top left, moving down, then to the right (if in two or more columns). The ports are probably labeled.

mrdenny
  • 27,174
  • 4
  • 41
  • 69
1

mdadm -h

sginfo -s /dev/sdX prints just the serial number.

There are several sg* commands from the collection of generic scsi commands sometimes only found by udev.

Have hardware raid controller that gives different answers as the location of the drive depending on whether it is the RAID card firmware, the BIOS, ipmi, or mptctl utility. So had to cross ref by serial numbers.

rjt
  • 578
  • 6
  • 26
1

When all else fails, you can identify the not-failed drives and work backwards.

find / -type f -exec cat {} \; >> /dev/null

Whichever drives activity lights do NOT come on are likely bad (and hopefully it's just one.) Note that if you have hot-spares configured, those won't light up either.

toppledwagon
  • 4,245
  • 25
  • 15
1

scsirastools has a set of tools that let you do various diagnostic tests on SCSI disks. You can also use sgmon to power down a disk under software control. This would at least let you identify the physical disk of you could locate it with the diagnostics.

If you have a hardware RAID controller the controller's BIOS or management software should have a facility that lets you identify bad disks.

0

They should be labeled on the chassis and correspond with the RAID Software.

On our Dells, the are not the way you would think. On ours 0:0 is bottom left, 0:1 is top left, 0:2 is bottom middle, etc. In all servers I've used (except homemade jobs), the RAID software will indicate the port, and it will be labeled.

dubRun
  • 1,061
  • 2
  • 12
  • 22