0

I've got a 9 disk raid 5 array.

Today i got a mail from my server:

This is an automatically generated mail message from mdadm
running on Eldorado

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sdi1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid5 sdb1[1] sdi1[9](F) sdd1[5] sdh1[3] sdj1[7] sde1[4] sdg1[6] sdf1[0] sdc1[2]
  7801484288 blocks level 5, 64k chunk, algorithm 2 [9/8] [UUUUUUUU_]

unused devices: <none>

This looks like /dev/sdi mhas a problem.

However, I ran

smartctl -t long -d 3ware,7 /dev/twa0

(the drives are on a 3ware controller, also i ran short and conveniance test before) and in any case, smartctl does not report a severe problem:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       7
  3 Spin_Up_Time            0x0027   228   109   021    Pre-fail  Always       -       1591
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       609
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15445
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       607
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       606
193 Load_Cycle_Count        0x0032   134   134   000    Old_age   Always       -       199738
194 Temperature_Celsius     0x0022   113   106   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%     15434         -
# 2  Short offline       Completed without error       00%     15434         -

So at the moment, I'm not sure what is causing the fault and whether i can just re-add the drive or need to replace it.

I'm on ubuntu 12.04 server, mdadm v3.2.5

Any clues?

I'm aware of the thread Ubuntu 12.04 Server Software RAID1 - Faulty Spare - Smart Output Passed - Confused which seems to mirror the problem. But this thread has not been answered, yet.

best regards, Stephan

1 Answers1

0

Assuming you're using consumer-grade drives, the most likely cause is that the drive took too long to respond to a request and the controller card assumed the drive had failed.

Consumer-grade drive firmware spends longer trying to recover data from hard-to-read sectors than server-grade firmware does. This makes them more reliable in single-disk operation, but when used in a RAID array, causes them to be marked as "failed" when there's nothing actually wrong with the drive.

Odds are there's nothing wrong with your drive. If you're feeling paranoid, you can run a surface scan for bad blocks (read-only or read-write), but I'd just put it back into the array.

Mark
  • 668
  • 4
  • 10