I have some servers that run Debian 8 with 8x800GB SSD configured as RAID6. All disks are connected to a LSI-3008 flashed to IT mode. In each server I also have a 2-disk pair as RAID1 for the OS.
current state
# dpkg -l|grep mdad
ii mdadm 3.3.2-5+deb8u1 amd64 tool to administer Linux MD arrays (software RAID)
# uname -a
Linux R5U32-B 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
# more /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md2 : active raid6 sde1[1](F) sdg1[3] sdf1[2] sdd1[0] sdh1[7] sdb1[6] sdj1[5] sdi1[4]
4687678464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/7] [U_UUUUUU]
bitmap: 3/6 pages [12KB], 65536KB chunk
md1 : active (auto-read-only) raid1 sda5[0] sdc5[1]
62467072 blocks super 1.2 [2/2] [UU]
resync=PENDING
md0 : active raid1 sda2[0] sdc2[1]
1890881536 blocks super 1.2 [2/2] [UU]
bitmap: 2/15 pages [8KB], 65536KB chunk
unused devices: <none>
# mdadm --detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Fri Jun 24 04:35:18 2016
Raid Level : raid6
Array Size : 4687678464 (4470.52 GiB 4800.18 GB)
Used Dev Size : 781279744 (745.09 GiB 800.03 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Jul 19 17:36:15 2016
State : active, degraded
Active Devices : 7
Working Devices : 7
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : R5U32-B:2 (local to host R5U32-B)
UUID : 24299038:57327536:4db96d98:d6e914e2
Events : 2514191
Number Major Minor RaidDevice State
0 8 49 0 active sync /dev/sdd1
2 0 0 2 removed
2 8 81 2 active sync /dev/sdf1
3 8 97 3 active sync /dev/sdg1
4 8 129 4 active sync /dev/sdi1
5 8 145 5 active sync /dev/sdj1
6 8 17 6 active sync /dev/sdb1
7 8 113 7 active sync /dev/sdh1
1 8 65 - faulty /dev/sde1
Problem
The RAID 6 array degrades semi-regularly, every 1-3 days or so. The reason for this is that one (any one) of its disks show up as faulty with the following error:
#dmesg -T
[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: attempting task abort! scmd(ffff8810350cbe00)
[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: [sde] CDB:
[Sat Jul 16 05:38:45 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Jul 16 05:38:45 2016] scsi target0:0:3: handle(0x000d), sas_address(0x500304801707a443), phy(3)
[Sat Jul 16 05:38:45 2016] scsi target0:0:3: enclosure_logical_id(0x500304801707a47f), slot(3)
[Sat Jul 16 05:38:46 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff8810350cbe00)
[Sat Jul 16 05:38:46 2016] end_request: I/O error, dev sde, sector 2064
[Sat Jul 16 05:38:46 2016] md: super_written gets error=-5, uptodate=0
[Sat Jul 16 05:38:46 2016] md/raid:md2: Disk failure on sde1, disabling device.md/raid:md2: Operation continuing on 7 devices.
[Sat Jul 16 05:38:46 2016] RAID conf printout:
[Sat Jul 16 05:38:46 2016] --- level:6 rd:8 wd:7
[Sat Jul 16 05:38:46 2016] disk 0, o:1, dev:sdd1
[Sat Jul 16 05:38:46 2016] disk 1, o:0, dev:sde1
[Sat Jul 16 05:38:46 2016] disk 2, o:1, dev:sdf1
[Sat Jul 16 05:38:46 2016] disk 3, o:1, dev:sdg1
[Sat Jul 16 05:38:46 2016] disk 4, o:1, dev:sdi1
[Sat Jul 16 05:38:46 2016] disk 5, o:1, dev:sdj1
[Sat Jul 16 05:38:46 2016] disk 6, o:1, dev:sdb1
[Sat Jul 16 05:38:46 2016] disk 7, o:1, dev:sdh1
[Sat Jul 16 05:38:46 2016] RAID conf printout:
[Sat Jul 16 05:38:46 2016] --- level:6 rd:8 wd:7
[Sat Jul 16 05:38:46 2016] disk 0, o:1, dev:sdd1
[Sat Jul 16 05:38:46 2016] disk 2, o:1, dev:sdf1
[Sat Jul 16 05:38:46 2016] disk 3, o:1, dev:sdg1
[Sat Jul 16 05:38:46 2016] disk 4, o:1, dev:sdi1
[Sat Jul 16 05:38:46 2016] disk 5, o:1, dev:sdj1
[Sat Jul 16 05:38:46 2016] disk 6, o:1, dev:sdb1
[Sat Jul 16 05:38:46 2016] disk 7, o:1, dev:sdh1
[Sat Jul 16 12:40:00 2016] sd 0:0:7:0: attempting task abort! scmd(ffff88000d76eb00)
Already tried
I have already tried the following, with no improvement:
- increase /sys/block/md2/md/stripe_cache_size from 256 to 16384
- increase dev.raid.speed_limit_min from 1000 to 50000
Need your help
Are these errors caused by mdadm configuration or the kernel or the controller?
Update 20160802
Follow the advice of ppetraki and others:
Use raw disk instead partition
This doesn't solve the issue
Decrease chunk size
The chunk size has beed modified to 128KB then 64KB but the RAID volume still degraded in few day. From dmesg is showing similar with previous error. I forget to try to reduce chunk size to 32KB.
Reduce number of RAID to 6 disks
I've tried to destroy existing RAID, zeroing superblock on each disk and create RAID6 with 6 disks (in raw disk) and 64KB chunks. Decrease number of disk RAID seems make array live longer, around 4-7 days before degraded
Update the driver
I just update the driver to Linux_Driver_RHEL6-7_SLES11-12_P12 (http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9300-8e). Disk error still appear like below
[Tue Aug 2 17:57:48 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880fc0dd1980)
[Tue Aug 2 17:57:48 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug 2 17:57:48 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug 2 17:57:48 2016] scsi target0:0:6: handle(0x0010), sas_address(0x50030480173ee946), phy(6)
[Tue Aug 2 17:57:48 2016] scsi target0:0:6: enclosure_logical_id(0x50030480173ee97f), slot(6)
[Tue Aug 2 17:57:49 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880fc0dd1980)
[Tue Aug 2 17:57:49 2016] end_request: I/O error, dev sdg, sector 0
Just a few moments ago, I have array degraded. This time /dev/sdf and /dev/sdg show error "attempting task abort! scmd"
[Tue Aug 2 21:26:02 2016]
[Tue Aug 2 21:26:02 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug 2 21:26:02 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug 2 21:26:02 2016] scsi target0:0:5: handle(0x000f), sas_address(0x50030480173ee945), phy(5)
[Tue Aug 2 21:26:02 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f), slot(5)
[Tue Aug 2 21:26:02 2016] scsi target0:0:5: enclosure level(0x0000), connector name( ^A)
[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88103beb5240)
[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88107934e080)
[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug 2 21:26:03 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00
[Tue Aug 2 21:26:03 2016] scsi target0:0:5: handle(0x000f), sas_address(0x50030480173ee945), phy(5)
[Tue Aug 2 21:26:03 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f), slot(5)
[Tue Aug 2 21:26:03 2016] scsi target0:0:5: enclosure level(0x0000), connector name( ^A)
[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88107934e080)
[Tue Aug 2 21:26:04 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug 2 21:26:04 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: sas_address(0x50030480173ee945), phy(5)
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: enclosure logical id(0x50030480173ee97f), slot(5)
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: enclosure level(0x0000), connector name( ^A)
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: handle(0x000f), ioc_status(success)(0x0000), smid(35)
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: request_len(4096), underflow(4096), resid(-4096)
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: tag(65535), transfer_count(8192), sc->result(0x00000000)
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
[Tue Aug 2 22:14:51 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880931d8c840)
[Tue Aug 2 22:14:51 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug 2 22:14:51 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug 2 22:14:51 2016] scsi target0:0:6: handle(0x0010), sas_address(0x50030480173ee946), phy(6)
[Tue Aug 2 22:14:51 2016] scsi target0:0:6: enclosure logical id(0x50030480173ee97f), slot(6)
[Tue Aug 2 22:14:51 2016] scsi target0:0:6: enclosure level(0x0000), connector name( ^A)
[Tue Aug 2 22:14:51 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880931d8c840)
[Tue Aug 2 22:14:52 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug 2 22:14:52 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: sas_address(0x50030480173ee946), phy(6)
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: enclosure logical id(0x50030480173ee97f), slot(6)
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: enclosure level(0x0000), connector name( ^A)
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: handle(0x0010), ioc_status(success)(0x0000), smid(85)
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: request_len(0), underflow(0), resid(-8192)
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: tag(65535), transfer_count(8192), sc->result(0x00000000)
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
[Tue Aug 2 22:14:52 2016] end_request: I/O error, dev sdg, sector 16
[Tue Aug 2 22:14:52 2016] md: super_written gets error=-5, uptodate=0
[Tue Aug 2 22:14:52 2016] md/raid:md2: Disk failure on sdg, disabling device. md/raid:md2: Operation continuing on 5 devices.
[Tue Aug 2 22:14:52 2016] RAID conf printout:
[Tue Aug 2 22:14:52 2016] --- level:6 rd:6 wd:5
[Tue Aug 2 22:14:52 2016] disk 0, o:1, dev:sdc
[Tue Aug 2 22:14:52 2016] disk 1, o:1, dev:sdd
[Tue Aug 2 22:14:52 2016] disk 2, o:1, dev:sde
[Tue Aug 2 22:14:52 2016] disk 3, o:1, dev:sdf
[Tue Aug 2 22:14:52 2016] disk 4, o:0, dev:sdg
[Tue Aug 2 22:14:52 2016] disk 5, o:1, dev:sdh
[Tue Aug 2 22:14:52 2016] RAID conf printout:
[Tue Aug 2 22:14:52 2016] --- level:6 rd:6 wd:5
[Tue Aug 2 22:14:52 2016] disk 0, o:1, dev:sdc
[Tue Aug 2 22:14:52 2016] disk 1, o:1, dev:sdd
[Tue Aug 2 22:14:52 2016] disk 2, o:1, dev:sde
[Tue Aug 2 22:14:52 2016] disk 3, o:1, dev:sdf
[Tue Aug 2 22:14:52 2016] disk 5, o:1, dev:sdh
I assume that error "attempting task abort! scmd" lead to degraded on array, but doesn't know what cause it.
Update 20160806
I've tried set other server with the same specs. Without mdadm RAID, each disk is mounted directly under ext4 filesystem. After a while kernel log show "attempting task abort! scmd" on some disks. This lead /dev/sdd1 error then remount to read-only mode
$ dmesg -T
[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug 6 05:21:09 2016] Read(10): 28 00 2d 29 21 00 00 00 20 00
[Sat Aug 6 05:21:09 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)
[Sat Aug 6 05:21:09 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)
[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88006b206800)
[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: attempting task abort! scmd(ffff88019a3a07c0)
[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug 6 05:21:09 2016] Read(10): 28 00 08 46 8f 80 00 00 20 00
[Sat Aug 6 05:21:09 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)
[Sat Aug 6 05:21:09 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)
[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88019a3a07c0)
[Sat Aug 6 05:21:10 2016] sd 0:0:3:0: attempting device reset! scmd(ffff880f9a49ac80)
[Sat Aug 6 05:21:10 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug 6 05:21:10 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Aug 6 05:21:10 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)
[Sat Aug 6 05:21:10 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)
[Sat Aug 6 05:21:10 2016] sd 0:0:3:0: device reset: SUCCESS scmd(ffff880f9a49ac80)
[Sat Aug 6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[Sat Aug 6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[Sat Aug 6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[Sat Aug 6 05:21:11 2016] end_request: I/O error, dev sdd, sector 780443696
[Sat Aug 6 05:21:11 2016] Aborting journal on device sdd1-8.
[Sat Aug 6 05:21:11 2016] EXT4-fs error (device sdd1): ext4_journal_check_start:56: Detected aborted journal
[Sat Aug 6 05:21:11 2016] EXT4-fs (sdd1): Remounting filesystem read-only
[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88024fc08340)
[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:
[Sat Aug 6 05:40:35 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Aug 6 05:40:35 2016] scsi target0:0:5: handle(0x000c), sas_address(0x4433221105000000), phy(5)
[Sat Aug 6 05:40:35 2016] scsi target0:0:5: enclosure_logical_id(0x500304801a5d3f01), slot(5)
[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: task abort: FAILED scmd(ffff88024fc08340)
[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88019a12ee00)
[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:
[Sat Aug 6 05:40:35 2016] Read(10): 28 00 27 c8 b4 e0 00 00 20 00
[Sat Aug 6 05:40:35 2016] scsi target0:0:5: handle(0x000c), sas_address(0x4433221105000000), phy(5)
[Sat Aug 6 05:40:35 2016] scsi target0:0:5: enclosure_logical_id(0x500304801a5d3f01), slot(5)
[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88019a12ee00)
[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88203eaddac0)
Update 20160930
After the controller firmware was upgraded to latest version (currently) 12.00.02, the issue dissapeared
Conclusion
The issue is solved