I've got a 9 disk raid 5 array.
Today i got a mail from my server:
This is an automatically generated mail message from mdadm
running on Eldorado
A Fail event had been detected on md device /dev/md0.
It could be related to component device /dev/sdi1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb1[1] sdi1[9](F) sdd1[5] sdh1[3] sdj1[7] sde1[4] sdg1[6] sdf1[0] sdc1[2]
7801484288 blocks level 5, 64k chunk, algorithm 2 [9/8] [UUUUUUUU_]
unused devices: <none>
This looks like /dev/sdi mhas a problem.
However, I ran
smartctl -t long -d 3ware,7 /dev/twa0
(the drives are on a 3ware controller, also i ran short and conveniance test before) and in any case, smartctl does not report a severe problem:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 7
3 Spin_Up_Time 0x0027 228 109 021 Pre-fail Always - 1591
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 609
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15445
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 607
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 606
193 Load_Cycle_Count 0x0032 134 134 000 Old_age Always - 199738
194 Temperature_Celsius 0x0022 113 106 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 15434 -
# 2 Short offline Completed without error 00% 15434 -
So at the moment, I'm not sure what is causing the fault and whether i can just re-add the drive or need to replace it.
I'm on ubuntu 12.04 server, mdadm v3.2.5
Any clues?
I'm aware of the thread Ubuntu 12.04 Server Software RAID1 - Faulty Spare - Smart Output Passed - Confused which seems to mirror the problem. But this thread has not been answered, yet.
best regards, Stephan