First the long story:
I have a RAID5 with mdadm on Debian 9. The Raid has 5 Disks, each 4TB of size. 4 of them are HGST Deskstar NAS, and one that came later is a Toshiba N300 NAS.
In the past days I noticed some read errors from that Raid. For example I had a 10GB rar archive in multiple parts. When I try to extract I get CRC errors on some of the parts. If I try it a second time, I get theses errors on other parts. That also happens with Torrents and a re-chack after download.
After a reboot my BIOS noticed me that the S.M.A.R.T status of a HGST drive on SATA Port 3 is bad. smartctl had sayed to me that there are DMA CRC errors, but claims that the Drive is OK.
Another reboot later, I can't see the crc errors in the smart anymore. But now I get this output
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-4-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 001 001 005 Pre-fail Always FAILING_NOW 1989
As the HGST aren't aviable for normale prices anymore, I bought another Toshiba N300 to replace the HGST. Both are labeled as 4TB. I tryed to make a Partition of the exact same size but it didn't worked. The partition programm claimed that my number is too big (I tried it with bytes and sectors). So I just made the Partition as big as posible. But now it looks like it is the same size, I'm a bit confused.
sdc is the old, and sdh is the new one
Disk /dev/sdc: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 4CAD956D-E627-42D4-B6BB-53F48DF8AABC
Device Start End Sectors Size Type
/dev/sdc1 2048 7814028976 7814026929 3,7T Linux RAID
Disk /dev/sdh: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3A173902-47DE-4C96-8360-BE5DBED1EAD3
Device Start End Sectors Size Type
/dev/sdh1 2048 7814037134 7814035087 3,7T Linux filesystem
Currently I have added the new one as a spare disk. The RAID is still working with the old Drive. I still have some read errors, especially on big files.
This is how my RAID Currently looks:
/dev/md/0:
Version : 1.2
Creation Time : Sun Dec 17 22:03:20 2017
Raid Level : raid5
Array Size : 15627528192 (14903.57 GiB 16002.59 GB)
Used Dev Size : 3906882048 (3725.89 GiB 4000.65 GB)
Raid Devices : 5
Total Devices : 6
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Jan 5 09:48:49 2019
State : clean
Active Devices : 5
Working Devices : 6
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Name : SERVER:0 (local to host SERVER)
UUID : 16ee60d0:f055dedf:7bd40adc:f3415deb
Events : 25839
Number Major Minor RaidDevice State
0 8 49 0 active sync /dev/sdd1
1 8 33 1 active sync /dev/sdc1
3 8 1 2 active sync /dev/sda1
4 8 17 3 active sync /dev/sdb1
5 8 80 4 active sync /dev/sdf
6 8 113 - spare /dev/sdh1
And the disk structure is this
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3,7T 0 disk
└─sda1 8:1 0 3,7T 0 part
└─md0 9:0 0 14,6T 0 raid5
└─storageRaid 253:4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253:5 0 14,6T 0 lvm /media/raidVolume
sdb 8:16 0 3,7T 0 disk
└─sdb1 8:17 0 3,7T 0 part
└─md0 9:0 0 14,6T 0 raid5
└─storageRaid 253:4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253:5 0 14,6T 0 lvm /media/raidVolume
sdc 8:32 0 3,7T 0 disk
└─sdc1 8:33 0 3,7T 0 part
└─md0 9:0 0 14,6T 0 raid5
└─storageRaid 253:4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253:5 0 14,6T 0 lvm /media/raidVolume
sdd 8:48 0 3,7T 0 disk
└─sdd1 8:49 0 3,7T 0 part
└─md0 9:0 0 14,6T 0 raid5
└─storageRaid 253:4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253:5 0 14,6T 0 lvm /media/raidVolume
sdf 8:80 1 3,7T 0 disk
└─md0 9:0 0 14,6T 0 raid5
└─storageRaid 253:4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253:5 0 14,6T 0 lvm /media/raidVolume
sdh 8:112 1 3,7T 0 disk
└─sdh1 8:113 1 3,7T 0 part
└─md0 9:0 0 14,6T 0 raid5
└─storageRaid 253:4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253:5 0 14,6T 0 lvm /media/raidVolume
I'm a bit confused that the spare disk (sdh) is already in the crypt volume.
Questions:
Under what criteria will mdadm say that a disk has Failed?
Can the random read errors come from one broken Disk?
Dosn't detect the raid it when a disk sends the wrong data?
Is it dangerouse to mark a disk manually as failed when the spare Disk has not the exact same size?