SATA hard resetting link

Question

Here is my dmesg output:

ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x6 frozen
ata2.00: cmd 60/48:08:6f:13:3a/00:00:01:00:00/40 tag 1 ncq 36864 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete
SCSI device sdb: 490350672 512-byte hdwr sectors (251060 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back

What does it mean? Can someone exactly say what is the problem for this error codes? ...(timeout) - ? or it's just another error.. or it is main error in this output?

Here is smart output:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   196   196   051    Pre-fail  Always       -       72539
  3 Spin_Up_Time            0x0027   200   200   021    Pre-fail  Always       -       991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5010
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   118   100   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

I can see only one problem - Reallocated_Event_Count. It is 1. And there is also only one error in dmesg. Is it possible that Reallocated_Event_Count and error in dmesg is connected? The disk is only 5000 hours old... Before I have had the same issues ... it is Western Digital RE2 250Gb disk.

I also have this problem. The system hangs for a few seconds when this happens. I have two hard drives a WD Caviar Black and a Caviar Green, they are brand new, and according to SMART they are in perfect condition. When I get hard resetting link messages the Linux only resets exactly one link, sometines the Black sometimes the Green.I have ASUS M4N78 PRO motherboard. Accoring to Asus it is Linux compatible, however I think the problem is in the motherboard or the chipset driver. Maybe it only happen if you use raid. The problem it totally random, I can't trigger it any way. — VargaD, Nov 30 '11 at 11:07

score 1 · Answer 1 · answered Dec 12 '10 at 00:03

1

No, the Reallocated_Event_Count should not cause the error in dmesg. The error in dmesg indicates that communication between the drive and the host chipset locked up and the drive needed to be reset. If this only happens once, I wouldn't consider it significant. If it occurs regularly, I would begin to wonder about upgrading the firmware in the drive, or seeing if the SATA cable in use was properly connected.

answered Dec 12 '10 at 00:03

David

1,062
6
9

I have this server in production for 4 years now. I have changed 5-6 disks since then. Last time problem was in cable as I thought - I wasn't able to format new drive, only after I have changed cable. First 2 year there was installed Seagate AS serie (AS - is desktop serie) and then work without problem 2 years. After there were errors (I actually dont remember exactly ... but 99% same errors) I have changes and bought WD RE serie.. RE - is enterprise... and have changed disks every 6 month :D – user52475 Dec 12 '10 at 00:31

score 0 · Answer 2 · answered Dec 12 '10 at 05:10

0

When I have had errors like yours they were normally fixed by replacing the drives (even though smart did not report errors - it's not always 100% accurate and I prefer to be safe). However, since this is a recurrent problem, you should consider the possibility that it is the cables (already changed so probably not) or the controller (try to add a PCI/PCIe controller and see if that helps?). Maybe upgrading the OS kernel would help too if interrupts get lost because of buggy chipset support.

answered Dec 12 '10 at 05:10

totaam

202
4
16

1

Have loaded system with bonnie++ and have received a lot of errors like frozen and timeout. Have changed cable - same result frozen and timeout. Have connected "bad" hdd to other server no errors.. although there was some hang during boot. Have loaded system with bonnie++ - no errors. Have connected back to original server errors again. Have switched power connector to other PSU line - and no errors! :) So.. Is it possible the problem was because of bad power connector/bad contact? And if it is not real disk problem but more like cable/connection will be there errors log in SMART? – user52475 Dec 12 '10 at 22:57
Glad you got it working in the end. I really don't know if smart has any way of reporting these sorts of issues (undercurrent or link issues), my *guess* is probably not (and this could also vary between drives). – totaam Dec 14 '10 at 10:49
Last two days tried to get errors again - no luck :) .. no errors ... I'm confused.. today will bring disk to seller and change them. – user52475 Dec 14 '10 at 12:25

SATA hard resetting link

2 Answers2