I have a "Areca ARC-1883IX-12" Raid Controller firmware 1.54 running OpenSuse 42.3 with XEN Hypervisor.
I use four instances of a copy command to copy four big binary files within the local file system :
cp /arecaDriveMnt/bigfile1.dat /arecaDriveMnt/bigfile1Copy1.dat
If I create this network HDD load with different processes I got the following error in /var/log/messages:
A few seconds after this error occures for the first time, the I/O throughput goes down from ~ 500MByte/s to basically zero and I need to restart the machine to gain access again to the Raid HDDs.
Edit: The error is independent from network traffic and also happens if I spawn enough processes copying local data on the local disk.
2018-04-05T14:11:39.267042+02:00 dom0 kernel: [ 3324.524188] arcmsr14: abort device command of scsi id = 6 lun = 0
2018-04-05T14:11:42.499045+02:00 dom0 kernel: [ 3327.756238] arcmsr14: abort device command of scsi id = 6 lun = 0
2018-04-05T14:11:45.731043+02:00 dom0 kernel: [ 3330.988233] arcmsr14: abort device command of scsi id = 6 lun = 0
2018-04-05T14:11:48.963033+02:00 dom0 kernel: [ 3334.220268] arcmsr14: abort device command of scsi id = 6 lun = 0
2018-04-05T14:11:52.195037+02:00 dom0 kernel: [ 3337.452336] arcmsr14: abort device command of scsi id = 6 lun = 0
2018-04-05T14:11:55.427038+02:00 dom0 kernel: [ 3340.684381] arcmsr14: abort device command of scsi id = 6 lun = 0
2018-04-05T14:11:58.659044+02:00 dom0 kernel: [ 3343.916533] arcmsr14: abort device command of scsi id = 6 lun = 0
2018-04-05T14:12:01.891054+02:00 dom0 kernel: [ 3347.148512] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 7
2018-04-05T14:12:33.891069+02:00 dom0 kernel: [ 3379.148850] arcmsr14: wait 'abort all outstanding command' timeout
2018-04-05T14:12:33.891093+02:00 dom0 kernel: [ 3379.150370] arcmsr14: executing hw bus reset .....
2018-04-05T14:12:46.923049+02:00 dom0 kernel: [ 3392.181980] arcmsr14: wait 'get adapter firmware miscellaneous data' timeout
The value in /sys/block/sdh/device/timeout
is 30
I did not make any configuration changes to the OS or the Bios Raid Controller the problem exists from the initial openSuse installation with optimized default BIOS settings and untouched Areca raid Settings.
I tried the following to fix the Error:
- Updating the BIOS
- distribute the IRQ Calls of the areca kernel module "arcmsr" and "eth1" to different processors ( see here )
- disable
irqbalance.service
Does anybody had a similar Issue and how could you fix it ?