14

I'm wondering if the results of this SMART selftest indicate a failing drive, this is the only drive that comes up with 'completed: read failure' in the results.

# smartctl -l selftest /dev/sde
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)   LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      8981         976642822
# 2  Extended offline    Aborted by host               90%      8981         -
# 3  Extended offline    Completed: read failure       90%      8981         976642822
# 4  Extended offline    Interrupted (host reset)      90%      8977         -
# 5  Extended offline    Completed without error       00%       410         -

The drive doesn't yet show any signs of failure, aside from the output from that SMART selftest. This is the output from a different drive in the same system which is currently running a SMART selftest

# smartctl -l selftest /dev/sdc
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 30%     15859         -
# 2  Extended offline    Completed without error       00%      9431         -
# 3  Extended offline    Completed without error       00%      8368         -


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   176   175   021    Pre-fail  Always       -       4183
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       48
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       8982
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       46
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       13
194 Temperature_Celsius     0x0022   111   101   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2
Jeff Welling
  • 422
  • 1
  • 4
  • 11
  • 2
    The data you added looks good enough. If the drive is part of a RAID array I wouldn't worry about it. You should be backing up your important files in the first place; now is a good time to start if you don't. – Chris S Jul 31 '11 at 15:46
  • The backup is *on* the RAID array, which the drive is a part of. The originals are still fine, and the RAID array won't be taken down by anything less than 2 simultaneous drive failures, so I think I'm alright. – Jeff Welling Jul 31 '11 at 15:58
  • 5
    @Jeff Welling: Not to be a pedant about it, but if your "backup" is on the RAID array, it's not a "backup" it is a "copy". Personally, if it were me, I'd replace the drive at the sign of drive failure. For what little cost there is to even a good drive these days, the insurance is well worth it. Also, I just experienced two drive failures, in the same (RAID10) array, in the same day. Out of 6 that were in the array. FWIW. – Kendall Jul 31 '11 at 21:09
  • 1
    @Kendall, I think he means the array is used for backups and the originals are elsewhere. If that's the case I'd chance it as it's somewhat unlikely two drives will fail (unless they are new drives, infant mortality is a common problem and burn in is a common practice in large arrays). – Chris S Jul 31 '11 at 21:49
  • If this was a business setup, I'd have several more backups (including offsite), but this is just a personal backup of fairly unimportant items in comparison. Also, due to the homegrown nature of the RAID array, all of the drives are from different places and batches so the chance of simultaneous failure is/should be (knock on wood) low. – Jeff Welling Aug 01 '11 at 00:32
  • 1
    The second failure is often triggered by raid rebuild. Especially in systems where drives rae not under heavy load. – Oleksandr Pryimak Jul 28 '13 at 23:39

5 Answers5

16

Hopefully you've long since replaced the drive, but since no one has yet directly answered the question...

You ran two tests, both of which failed to read the same logical sector of the disk, as indicated by Completed: read failure and the same LBA in both tests. This does indeed indicate the disk has a defect, and you should be able to have it replaced under warranty. Attempting to store data in this sector may or may not cause the drive to notice it's defective during the write process and remap the sector, but if the drive doesn't notice, and can't read the data later on, you've lost it.

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972
5

Is your data worth risking on a suspect drive?

If it were me, I'd replace the drive and be thankful that SMART saved me a big headache.

Bacon Bits
  • 1,531
  • 1
  • 9
  • 9
  • In addition I'd at the very least setup up a cron script to run smart once a week on your drives and then have it send the output in a report or email to you each week so that you can in most cases identify ahead of time which drives might be on their last legs to avoid having to recover from a failure and having to restore from backups. Easier yet for if you have multiple machines is using a monitoring tool like Nagios or Munin. – Wilshire Jul 31 '11 at 15:45
  • 7
    That's easier to do when you know what smart output indicates a failing drive, it's hard to tell what does and does not indicate a failing drive. – Jeff Welling Jul 31 '11 at 15:54
  • You don't need to make a cron script, there is a [smartd daemon](http://smartmontools.sourceforge.net/man/smartd.8.html) in the smartmontools package that handles just what you want to do: regular checking of SMART status. All you need is to [create a configuration](http://smartmontools.sourceforge.net/man/smartd.conf.5.html) and start the service. The smartmontools package also contains some sample scripts that smartd can call when something starts failing. – Sgaduuw Jul 31 '11 at 18:32
  • I'm not using a cron script, I'm using the smartd daemon. It spits out notes in the system log, I noticed some lines I don't normally see on any other drives and attempted a selftest, which when I checked had failed. I'd never seen this kind of failure before, so I thought people on here might have. The syslog output of smartd is pretty cryptic if you don't have a ton of experience with it, it doesn't exactly tell you "Drive X is dying and needs to be replaced" though it would be nice if it did :) – Jeff Welling Aug 06 '11 at 04:06
5

What will I do in your situation?

First of all I find out which files are affected. There are some instructions how to do this https://www.smartmontools.org/wiki/BadBlockHowto. Yeah. In your case it is harder because you have an array. But it is possible. Than, ensure that this file is backuped, than write zeros to the failing sector. Two things can happen.

  1. The drive successfully writes zeroes to this sector. Current_Pending_Sector, Reallocated_Sector_Ct should be zeros afterwards.
  2. The drive fails to write to this sector. Than it remaps this sector to a "spare" area.

In any case you end up with a fixed drive. You should restore your file from backup (because you overwrote one sector of it). Also you should rerun en extended self-test to ensure that there are no more errors.

Stay healthy!

P.S. I know that this post is kind of old. But I goolged it. And I think it is a good idea to provide another good answer.

Rve
  • 103
  • 4
1

The drive was likely on its way out. Being unable to read from part of the drive is most definitely a failure condition, and it is certainly possible for it to happen without other typical signs of disk failure. This type of thing isn't commonly transient; with no other signs it might be a weak head, a very slight alignment issue, or a defective area on a platter (cylinder?).

The other alternative is that there was a SMART bug; you really don't want to be running a drive with buggy firmware.

Anytime you see any error at all from SMART, it is a strong sign that you should get a new drive to avoid data loss. It's intended as an early warning system, in part.

Falcon Momot
  • 25,244
  • 15
  • 63
  • 92
0
  • Backup as soon as you can!

  • If this drive is still in warranty, then

    • run the vendor's check utitity (you can usually get a boot cd)
    • if this returns error then bingo, send it back and wait for replacement
    • restore from backup
    • problem solved - END

  • If this drive has no warranty then you are screwed
    • there is still some hope...
    • as this is actually a read error only it does not mean you cannot write to it
    • after making a backup you can try to restore the backup as it will overwrite there unreadable sectors with new data which you can actually read back (well, usually this works, in the background the drive will remap these blocks to spare sectors most of the time)
    • badblocks tool can be also used for this (you already have backups, right?)
      • you do not actually use this to test the disk (does not make much sense with never disks anyways), but to write to these sectors multiple times
    • you can re-run the smart tests again and there is chance that these unreadable sectors "correct themselves"
    • problem NOT solved, you only made the drive last longer, probably it will fail faster than normally maybe in a year depending on its usage, but hey disks are cheap, get a new one if your data is important for you - END
cstamas
  • 6,707
  • 25
  • 42
  • 1
    Modern hard drives (like since the turn of the century) don't work the way you've described in the "no warranty" section. – Chris S Jul 31 '11 at 21:54
  • Please then get some reference to prove it otherwise. As far as I know even some vendor check utilities and repair tool do this. – cstamas Aug 03 '11 at 09:03
  • 3
    Start with [Wikipedia's Bad Sector](http://en.wikipedia.org/wiki/Bad_sector) article. Hard drives abstract the logical sector address and map it to sectors it believes are good. Some vendor utilities (sometimes SMART, depending on what is exposed by the drive) can report on remapped sectors. Bad sectors are detected on write operations normally. Usually once it's written it can be read again; it's the initial write operation that commonly fails on bad sectors. Once a sector is bad it's bad forever, there's no "correcting" it. – Chris S Aug 03 '11 at 12:51
  • 1
    I think I did not say anything that is against what you are saying, but I clarified a bit to make it more "technically correct". – cstamas Aug 11 '11 at 15:06
  • 2
    Not sure why people down-voted your answer so much. I think you're spot on. People probably misunderstood that you are advocating keeping a flaky drive in operation. But considering the OP is a home-user, the cost for a new drive can very well be a concern, even at today's prices. I know this is a pretty old question, but from me, at least, you get a +1. ;) – Markus A. Aug 11 '13 at 20:53
  • 2
    @cstamas: Can also agree that your answer is spot on - if a drive survives a full run of `badblocks -w` (3x writing, 3x reading) without creating new bad sectors I'll keep it. Otherwise it's just too broken to be used somewhere. – kei1aeh5quahQu4U Oct 28 '13 at 20:02