Syncing my postgres master to the slave server resulted in write I/O errors on the slave (journalctl):
Aug 18 03:09:23 db01a kernel: EXT4-fs warning (device dm-3):
**ext4_end_bio:330: I/O error -5 writing to inode 86772956 (offset 905969664 size 8388608 starting block 368694016)**
Aug 18 03:09:23 db01a kernel: buffer_io_error: 326 callbacks suppressed
....
Reading the affected file of course also doesn't work:
cat base/96628250/96737718 >> /dev/null
cat: 96737718: Input/output error
Shouldn't the linux kernel (ubuntu 16.04 4.4.0-87-generic) kick the affected drive from the array automatically?
As it is a Raid6 (LVM and ext4 on top) I already tried to overwrite every SSD a few times with badblocks to provoke the error (removed one disk after another from the raid for that), unfortunately with no success.
smartctl says one disk had errors before (the others are clean):
smartctl -a /dev/sda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 2
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 2
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 2
187 Uncorrectable_Error_Cnt 0x0032 099 099 000 Old_age Always - 3
195 ECC_Error_Rate 0x001a 199 199 000 Old_age Always - 3
But rewriting the whole disk with badblocks -wsv worked without error.
As it is a pretty important server for me, I replaced the whole server with a different model, but the error persisted. Am I correct in thinking that it's probably a disk issue?
Is there any way to know which disk is affected, maybe by calculating?
EDIT: For clarification: What I'm not getting is how the initial sync of 1.5 TB data from the master to the slave can result in two unrecoverable I/O errors, but manually running destructive read-write tests on every involved SSD completes without any error.
EDIT2: Output of lsblk (identical for sda-sdf); pvs; vgs; lvs;
lsblk:
sda1 8:16 0 953.9G 0 disk
├─sda1 8:17 0 4.7G 0 part
│ └─md0 9:0 0 4.7G 0 raid1
└─sda5 8:21 0 949.2G 0 part
└─md1 9:1 0 2.8T 0 raid6
├─vgdb01a-lvroot 252:0 0 18.6G 0 lvm /
├─vgdb01a-lvvar 252:1 0 28G 0 lvm /var
├─vgdb01a-lvtmp 252:2 0 4.7G 0 lvm /tmp
└─vgdb01a-lvpostgres 252:3 0 2.6T 0 lvm /postgres
pvs:
PV VG Fmt Attr PSize PFree
/dev/md1 vgdb01a lvm2 a-- 2.78t 133.64g
vgs:
VG #PV #LV #SN Attr VSize VFree
vgdb01a 1 4 0 wz--n- 2.78t 133.64g
lvs:
lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lvpostgres vgdb01a -wi-ao---- 2.60t
lvroot vgdb01a -wi-ao---- 18.62g
lvtmp vgdb01a -wi-ao---- 4.66g
lvvar vgdb01a -wi-ao---- 27.94g
Update 2017-8-22
echo check > /sys/block/md1/md/sync_action
[Mon Aug 21 16:10:22 2017] md: data-check of RAID array md1
[Mon Aug 21 16:10:22 2017] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[Mon Aug 21 16:10:22 2017] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[Mon Aug 21 16:10:22 2017] md: using 128k window, over a total of 995189760k.
[Mon Aug 21 18:58:18 2017] md: md1: data-check done.
echo repair > /sys/block/md1/md/sync_action [Tue Aug 22 12:54:11 2017] md: requested-resync of RAID array md1
[Tue Aug 22 12:54:11 2017] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[Tue Aug 22 12:54:11 2017] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
[Tue Aug 22 12:54:11 2017] md: using 128k window, over a total of 995189760k.
[2160302.241701] md: md1: requested-resync done.
e2fsck -y -f /dev/mapper/vgdb01a-lvpostgres
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/vgdb01a-lvpostgres: 693517/174489600 files (1.6% non-contiguous), 608333768/697932800 blocks
Update 2017-8-22 2 Output of lsscsi and all disk smartctl on pastebin: https://pastebin.com/VUxKEKiF