0

When copying large files (50+GB) from an NVMe disk to a SATA 7200rpm HDD disk I see the following error in the logs on a fully patched Ubuntu 20.04:

Aug 08 00:45:59 host kernel: ata6.00: exception Emask 0x20 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 08 00:45:59 host kernel: ata6.00: irq_stat 0x20000000, host bus error
Aug 08 00:45:59 host kernel: ata6.00: failed command: WRITE DMA EXT
Aug 08 00:45:59 host kernel: ata6.00: cmd 35/00:08:30:a2:e0/00:00:e8:00:00/e0 tag 23 dma 4096 out
                                    res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x20 (host bus error)
Aug 08 00:45:59 host kernel: ata6.00: status: { DRDY }
Aug 08 00:45:59 host kernel: ata6: hard resetting link
Aug 08 00:46:00 host kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 08 00:46:00 host kernel: ata6.00: configured for UDMA/133
Aug 08 00:46:00 host kernel: ata6: EH complete

ata6.00 is the disk which is being written to.
The issue is intermittent. Sometimes does not appear for 24 hours, sometimes a couple times per hour. Often times the disk recovers, but sometimes the filesystem just becomes corrupt, needs to be unmounted, repaired (if possible) and remounted.

What I tried:

  1. I tried 3 different brands of HDD. All have the same issue.
  2. I suspected hardware issue. I replaced the motherboard and SATA cables. None of this helped.
  3. I have another server with an identical configuration. The issue does not occur there. Same workload.
  4. I have yet another server with a completely different configuration (Intel vs. AMD). The issue occurs there. Same workload.
  5. I disabled NCQ via echo 1 > /sys/block/sda/device/queue_depth. Did not help.

I ran out of ideas...
These are all data center grade components. Given the steps I've taken, I suppose it's not a hardware manufacturing defect.
Could this be software/OS/BIOS related?
Any ideas what else should I try?

mike
  • 221
  • 1
  • 4
  • 12
  • What are data center grade components? What is the HBA you are using? What is the motherboard? What is the RAM? – Michael Hampton Aug 10 '21 at 21:40
  • There is no HBA. The disks connect directly to SATA ports on the MB. The motherboard is Supermicro MBD-X11SPM-F-O. RAM is Samsung DDR4-3200, 8GB, ECC RDIMM, 1Rx8, 288pin. – mike Aug 11 '21 at 06:43
  • This still looks like a controller or cabling issue, but you might run `smartctl -a` on the disks to see if they have recorded errors. – Michael Hampton Aug 11 '21 at 13:14
  • It does show errors, but they're cryptic to me. Not sure where to go from there. https://gist.github.com/ceecko/c74c2aafc7d0b7fa1f9ad9a71e7d4717. I suspected controller or cabling issue but since both were replaced, I think the chances of both being bad are slim... – mike Aug 11 '21 at 17:54
  • You said you had multiple disks, but that gist shows the results for only one. Where are the rest of them? – Michael Hampton Aug 11 '21 at 17:57
  • I have just updated the gist with all the disks, including nvme disk which is used as a source for copy. – mike Aug 12 '21 at 06:46
  • Only _one_ of the three disks is showing these errors. You should try replacing this disk. – Michael Hampton Aug 12 '21 at 11:43
  • It does not seem to be the disk though. The `/dev/sdc` is connected via `ata6` and is used as a boot disk. This disk has failed even though there's nothing in the smart log. At that time, the disk with errors was mounted but not used. Do you think `/dev/sda` could have caused `/dev/sdc` to fail in such a way? As mentioned previously, these disks are the 3rd type of disks I tried. It would be a great coincidence to have 3rd batch of disks with the same issues I guess. – mike Aug 12 '21 at 17:56

2 Answers2

1

Perhaps this is more a problem of operating temperature? As the disk becomes constantly in use, its physical position and heat gain to loss ratio gets too high leading to erratic behaviour?

On newer kernels like yours drive temperature can be put in sysfs at this path:

/sys/class/hwmon/*

Be sure to make sure that the drivetemp module is loaded with modprobe drivetemp.

You could consider monitoring the files in here and beginning a large file copy again, the kernel documentation here provides an indication of how these files are to be interpreted.

They include useful values like the operating min/max temperatures, some drivers can also offer alarm indicators too which are chip-dependant alarms that are triggered on a fault.

Matthew Ife
  • 23,357
  • 3
  • 55
  • 72
0

Seems to be resolved by upgrading to Ubuntu 21.04. No idea why though. The server runs stable now without any ATA issues.

mike
  • 221
  • 1
  • 4
  • 12