Faulty SATA disk but with a periodic error?

Question

I have a Seagate St2000dm001 2TB Barracuda Sata3 disk that is producing errors similar to this :

[Tue Jun 14 10:02:06 2022] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[Tue Jun 14 10:02:06 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 10:02:06 2022] ata2.00: cmd 61/00:00:00:48:9f/02:00:b2:00:00/40 tag 0 ncq 262144 out
[Tue Jun 14 10:02:06 2022]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Tue Jun 14 10:02:06 2022] ata2.00: status: { DRDY }
[Tue Jun 14 10:02:06 2022] ata2: hard resetting link
[Tue Jun 14 10:02:16 2022] ata2: softreset failed (1st FIS failed)
[Tue Jun 14 10:02:16 2022] ata2: hard resetting link
[Tue Jun 14 10:02:26 2022] ata2: softreset failed (1st FIS failed)
[Tue Jun 14 10:02:26 2022] ata2: hard resetting link
[Tue Jun 14 10:02:42 2022] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[Tue Jun 14 10:02:42 2022] ata2.00: configured for UDMA/133
[Tue Jun 14 10:02:42 2022] ata2.00: device reported invalid CHS sector 0
[Tue Jun 14 10:02:42 2022] ata2: EH complete

I tested the disk with different cables and on different machines, and the errors persist. It appears like a clear-cut case of a broken disk, but there is a twist. Greping the errors while doing a very long mkfs.ext4 -c -c, gives a periodic pattern for the errors :

[Mon Jun 13 10:47:02 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 11:51:08 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 12:55:14 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 14:01:21 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 15:08:27 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 16:15:33 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 17:22:39 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 18:29:43 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 19:36:49 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 20:43:55 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Mon Jun 13 21:50:02 2022] ata2.00: failed command: READ FPDMA QUEUED
[Mon Jun 13 22:57:08 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 00:04:14 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 01:11:17 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 02:15:24 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 03:19:30 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 04:26:36 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 05:33:42 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 06:40:48 2022] ata2.00: failed command: READ FPDMA QUEUED
[Tue Jun 14 07:47:54 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 08:55:00 2022] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Jun 14 10:02:06 2022] ata2.00: failed command: WRITE FPDMA QUEUED

It is almost exactly every 1 hour and 7 minutes. I thought it could be related to smartd, but smartd was not running. So, I'm stuck : what kind of hardware fault would give a periodic error with a period of 1 hour and 7 minutes ? Any ideas would be highly appreciated.

Best regards,

Nicholas

Possibly related: https://unix.stackexchange.com/questions/623238/root-causes-for-failed-command-write-fpdma-queued and https://serverfault.com/questions/952148/failed-command-write-fpdma-queued-cause-of-server-running-slow also of note/obligatory: back up your drive ASAP! — TCooper, Jun 14 '22 at 20:54
Just a quick comment to point out that Seagate Barracuda are the worst drives money can buy. WD Blue are not great, but Barracuda? I've set up 3000 of them, and had to return 1800 under RMA. That's how bad they are. — wazoox, Jun 16 '22 at 19:35

Marcus Müller · Accepted Answer · 2022-06-15T16:12:50.027

21

That's almost exactly 4000 seconds, within the accuracy of a cheap oscillator.

This means that probably, something in the SATA drive or SATA controller firmware does this automatically.

The reason for that could be anything, basically. For example, the drive firmware resetting every 4000s when some component-checking subroutine fails. The SATA controller firmware resetting every 4000s when it tries to re-negotiate a link and that fails, or anything else, really (these two examples are not more likely than anything else).

The only thing the timing suggests is it's software deciding to do that, whether it's software that you run as operating system or as controller or as drive firmware. And that might be a software bug, or a real detection of a hardware error.

So, really hard to diagnose this. If controller and drive are already at their recent firmware revisions (fwupdmgr get-updates is your friend, for both), well.

edited Jun 15 '22 at 16:12

answered Jun 14 '22 at 08:55

Marcus Müller

500
4
13

Unless something changed, the command is `fwupdmgr --get-updates`. – Braiam Jun 15 '22 at 15:07
Or perhaps `fwupdmgr get-updates`. – UncleCarl Jun 15 '22 at 15:57
yes. Never type commands into a live system from the top of your head :) Fixed! – Marcus Müller Jun 15 '22 at 16:13

Faulty SATA disk but with a periodic error?

1 Answers1