Buffer I/O Error on md-device - can't identify failed drive

Question

Syncing my postgres master to the slave server resulted in write I/O errors on the slave (journalctl):

Aug 18 03:09:23 db01a kernel: EXT4-fs warning (device dm-3): 
**ext4_end_bio:330: I/O error -5 writing to inode 86772956 (offset 905969664 size 8388608 starting block 368694016)**                  
Aug 18 03:09:23 db01a kernel: buffer_io_error: 326 callbacks suppressed

....

Reading the affected file of course also doesn't work:

cat base/96628250/96737718  >> /dev/null
cat: 96737718: Input/output error

Shouldn't the linux kernel (ubuntu 16.04 4.4.0-87-generic) kick the affected drive from the array automatically?

As it is a Raid6 (LVM and ext4 on top) I already tried to overwrite every SSD a few times with badblocks to provoke the error (removed one disk after another from the raid for that), unfortunately with no success.

smartctl says one disk had errors before (the others are clean):

 smartctl -a /dev/sda
 ID# ATTRIBUTE_NAME         FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

 5  Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       2

179 Used_Rsvd_Blk_Cnt_Tot   0x0013   099   099   010    Pre-fail  Always       -       2

183 Runtime_Bad_Block       0x0013   099   099   010    Pre-fail  Always       -       2

187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       3

195 ECC_Error_Rate          0x001a   199   199   000    Old_age   Always       -       3

But rewriting the whole disk with badblocks -wsv worked without error.

As it is a pretty important server for me, I replaced the whole server with a different model, but the error persisted. Am I correct in thinking that it's probably a disk issue?

Is there any way to know which disk is affected, maybe by calculating?

EDIT: For clarification: What I'm not getting is how the initial sync of 1.5 TB data from the master to the slave can result in two unrecoverable I/O errors, but manually running destructive read-write tests on every involved SSD completes without any error.

EDIT2: Output of lsblk (identical for sda-sdf); pvs; vgs; lvs;

lsblk:
sda1                        8:16   0 953.9G  0 disk                                                
├─sda1                     8:17   0   4.7G  0 part                                                
│ └─md0                    9:0    0   4.7G  0 raid1                                               
└─sda5                     8:21   0 949.2G  0 part                                                
  └─md1                    9:1    0   2.8T  0 raid6                                               
    ├─vgdb01a-lvroot     252:0    0  18.6G  0 lvm   /                                             
    ├─vgdb01a-lvvar      252:1    0    28G  0 lvm   /var                                          
    ├─vgdb01a-lvtmp      252:2    0   4.7G  0 lvm   /tmp                                          
    └─vgdb01a-lvpostgres 252:3    0   2.6T  0 lvm   /postgres 

pvs: 
PV         VG      Fmt  Attr PSize PFree  
/dev/md1   vgdb01a lvm2 a--  2.78t 133.64g

vgs:
VG      #PV #LV #SN Attr   VSize VFree  
vgdb01a   1   4   0 wz--n- 2.78t 133.64g

lvs:
lvs
LV         VG      Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
lvpostgres vgdb01a -wi-ao----  2.60t                                                    
lvroot     vgdb01a -wi-ao---- 18.62g                                                    
lvtmp      vgdb01a -wi-ao----  4.66g                                                    
lvvar      vgdb01a -wi-ao---- 27.94g

Update 2017-8-22

echo check > /sys/block/md1/md/sync_action
[Mon Aug 21 16:10:22 2017] md: data-check of RAID array md1
[Mon Aug 21 16:10:22 2017] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[Mon Aug 21 16:10:22 2017] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[Mon Aug 21 16:10:22 2017] md: using 128k window, over a total of 995189760k.
[Mon Aug 21 18:58:18 2017] md: md1: data-check done.

echo repair > /sys/block/md1/md/sync_action    [Tue Aug 22 12:54:11 2017] md: requested-resync of RAID array md1
[Tue Aug 22 12:54:11 2017] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[Tue Aug 22 12:54:11 2017] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
[Tue Aug 22 12:54:11 2017] md: using 128k window, over a total of 995189760k.
[2160302.241701] md: md1: requested-resync done.

e2fsck -y -f /dev/mapper/vgdb01a-lvpostgres
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/vgdb01a-lvpostgres: 693517/174489600 files (1.6% non-contiguous), 608333768/697932800 blocks

Update 2017-8-22 2 Output of lsscsi and all disk smartctl on pastebin: https://pastebin.com/VUxKEKiF

I think you destroyed your software raid6 by rewriting the whole disk with badblocks. You should only rewrite bad sector to initiate sector reallocation. — Mikhail Khirgiy, Aug 18 '17 at 16:20
sudo cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] md1 : active raid6 sdf5[6] sdd5[10] sdb5[5] sda5[8] sde5[7] 2985569280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU] bitmap: 4/8 pages [16KB], 65536KB chunk md0 : active raid1 sdd1[8] sdb1[5] sde1[6] sdf1[4] 4877312 blocks super 1.2 [4/4] [UUUU] — Toni, Aug 21 '17 at 07:32
Please post the output of `mdadm --examine-badblocks /dev/sd[abcdef]`. Also, try to read from the md array by issuing `dd if=/dev/md1 of=/dev/null bs=1M iflag=direct`. Can you read without errors? If not, please show the output of `dmesg`. — shodanshok, Aug 21 '17 at 14:26
@shodanshok mdadm: mbr metadata does not support badblocks; dd gives an i/o error, probably same as when manually catting the corrupt file: sudo cat 96737718_backup_170818_ts >> /dev/null cat: 96737718_backup_170818_ts: Input/output error(I recovered the original file from the master server but kept the I/O error around). Nothing in the kernel unfortunately — Toni, Aug 22 '17 at 15:24
So, a direct `dd` of the array generate a read error *without* a `dmesg` info? Can you show your `dmesg` output (using something as [pastebin](https://pastebin.com/)) — shodanshok, Aug 22 '17 at 15:35
You have to look at the mismatch_count after the md repair to see what it did and then **run it again** and see if the mismatch_count goes to zero; if it doesn't go to zero then you're cooked. This is all a moot point. The smartctl report has told you that the drive is EOL and you've already experienced data corruption. No amount of spit and polish is going to fix this. The question is do you want to deal with the array while you can plan the downtime or... when your cell phone rings? Lose someone's data and they're never forgive you. Good luck. — ppetraki, Aug 22 '17 at 15:44
Please add `dmesg` output on pastebin. Also, can you read, without errors, from sda issuing `dd if=/dev/sda of=/dev/null bs=1M iflag=direct` ? — shodanshok, Aug 22 '17 at 16:00
@ppetraki the server is "only" a live replica in case the master fails. Full offsite backups are done twice a day on the master server. Nevertheless I'd like to fix the situation as soon as possible. I'm just not sure if I can trust the smart attributes and I also don't want to have to replace ALL disks without being able to replace them under warranty (a few bad blocks don't seem to be enough for warranty). — Toni, Aug 22 '17 at 16:14
@shodanshok sudo cat /sys/block/md1/md/mismatch_cnt 0; dmesg/journalctl aren't showing any errors at all. I think that at the moment the array is clean. Should I delete the file with the I/O error and see if reading the complete array succeeds? — Toni, Aug 22 '17 at 16:19
@toni smart is the last word on disk health. If you don't think you can trust it then I don't know what to tell you. Interpreting it correctly is problematic but from what you've shown, there is a strong correlation of being out of free blocks to remap with the re-occurrence of media errors that despite your best efforts continue to exist. You should check the primary too. If you really want to get warranty work on this, pull the drive, and start writing random data over and over again until you get nothing but hard media errors. But you'll leave your array vulnerable in the process. — ppetraki, Aug 22 '17 at 16:25
@ppetraki it's only what I read elsewhere: https://askubuntu.com/questions/325283/how-do-i-check-the-health-of-a-ssd/460463#460463 I'm sorry if this is a stupid question, but where exactly do you see the free blocks? Used_Rsvd_Blk_Cnt_Tot? Btw, two of the disks are new (sde, sdf), see Power_On_Hours. The primary does monthly md checks, but can't hurt to schedule a check for tonight, thanks! — Toni, Aug 22 '17 at 16:35
@Toni: as stated above, I suggest you to run `dd if=/dev/sda of=/dev/null bs=1M iflag=direct` and paste the entire `dmesg` output. — shodanshok, Aug 22 '17 at 18:02
@ppetraki I think nobody is questioning SMART's reports. However, the key question here is why an URE on a single disks cause a RAID6 array can not be read cleanly. Basically, it seems that a single problem caused a double-parity RAID do become inconsistent. I would like to understand *what* is at a play here... — shodanshok, Aug 22 '17 at 18:05
@shodanshok RAID 5/6 stripe write. It's that simple. With the added bonus that it has to recompute parity and rewrite it *twice* if anything in that stripe changes. It's write amplification city. MD doesn't even have a read cache at the block level to my knowledge. You're getting that for free at the filesystem level. So if the filesystem cache evicted the page you're looking for, MD has to read the entire stripe, and there's your read error. Now RAIDs can play games where they only read/write part of the stripe but there's no getting around parity update on write. — ppetraki, Aug 22 '17 at 23:54
@toni wow, wikipedia helps. https://en.wikipedia.org/wiki/S.M.A.R.T. , reallocated sectors count is your problem. and.. you're at pre-fail across the board. Which makes sense, the RAID has been spreading the writes around + parity. This is what happens when you RAID5/6 SSDs, they simply run out of write capacity. If you're going to do this they need to be commercial grade disks and even then it's a problem. RAID 10 is a better choice. It's almost a foreign concept to me because the last storage array I worked on nearly eliminated write amp :) — ppetraki, Aug 23 '17 at 00:16
@ppetraki no, double parity arrays should survive the loss of two disks or an URE *at the same location* on two disks. A reallocated/unreadable sector on a single disks should not cause the problem the OP reports. — shodanshok, Aug 23 '17 at 07:27
@shodanshok No, you're right. I get it now. So... When does a RAID use it parity? When it detects an error. This RAID isn't detecting any errors, the filesystem is. If it was an error on read the RAID should have been able to detect it, use parity to reconstruct that chunk, send the new chunk up and write back the problem chunk. But that's not happening here. How do we get around this? There's a specific range that's a problem area and as long as the drives stay in their current configuration, that stripe range will continue to throw error. Answer continued in next comment. — ppetraki, Aug 23 '17 at 10:36
@shodanshok We rotate the disks in the raid. Pull the first disk and blank it, and then swap the 1st and 2nd disk and rebuild. Continue doing this until the first disk becomes the last disk. At which point we should have rebuilt every chunk of the array from parity. Theory is we can read from this chunk, but we don't detect the write failure, so the parity is consistent on this stripe and never used to update the bad seg. This method should rebuild every stripe from parity in new LBAs/disks so we should find the bad segment on write. I guess this is why people have switched to ZFS. — ppetraki, Aug 23 '17 at 10:56
@shodanshok dd of sd[a-f] completes w/o errors: sudo dd if=/dev/sda of=/dev/null bs=1M iflag=direct 976762+1 records in 976762+1 records out 1024209543168 bytes (1.0 TB, 954 GiB) copied, 5705.13 s, 180 MB/s — Toni, Aug 23 '17 at 14:00
@Toni I suggest you to explain your case on the lixun-raid mailing list. — shodanshok, Aug 23 '17 at 14:23
@ppetraki New errors appeared (as expected): EXT4-fs warning (device dm-3): ext4_end_bio:330: I/O error -5 writing to inode 127669402 (offset 0 size 0 starting block 368691216) Buffer I/O error on device dm-3, logical block 368691216 EXT4-fs warning (device dm-3): ext4_end_bio:330: I/O error -5 writing to inode 127669402 (offset 0 size 0 starting block 368691216) Buffer I/O error on device dm-3, logical block 368691216 — Toni, Aug 30 '17 at 09:20
postgres: 2017-08-30 11:08:30 CEST [4895-12] ERROR: could not fsync file "base/365109265/414412942": Input/output error ---- Still no disk kicked by md though :( Going to order an enterprise ssd to replace the one with the reallocated sectors. — Toni, Aug 30 '17 at 09:21
Haven't managed to post it to linux-raid mailing list yet, but good idea! — Toni, Aug 30 '17 at 09:29
@Toni I would suggest going with a RAID 10 next time if random db writes is your primary workload. Also check out, https://www.percona.com/blog/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/ and https://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/ — ppetraki, Aug 30 '17 at 15:36

ppetraki · Answer 1 · 2017-08-22T13:16:31.593

UPDATE-8/22

If you want to solve this problem quickly just replace the two drives that have the worst smartctl stats and reassess. Once you're out of reserved blocks your drive is EOL. Seeing that we buy these all at once they tend to fail around the same time. So it doesn't matter which one is the bad block is pinned to. Once you replace the worst two offenders (that means replace one and resync and repeat), you'll have increased the overall health of the array, probably replaced the complaining disk, and dramatically reduced the risk of a double fault where you lose everything.

At the end of the day, your data is worth more than a few hundred bucks.

ENDUPDATE-8/22

UPDATE-8/21

Toni Yes, your original post has room for improvement. Given those facts this is the conclusion I arrived to. It also wasn't clear until now that you already suffered data corruption.

It would be helpful if you included the headers with the smartctl output. This is easier on scsi, sg_reassign will tell you how many free blocks you have left to reassign, once that goes to zero, you're done. Seeing that smartctl is reporting "prefail" in several categories it sounds like you're done soon too.

Soon you'll experience hard media errors and MD will kick the drive. fsck would be a good idea in the meanwhile. When drives fail a write they reassign the destination from the free block pool, when you run out, it becomes a unrecoverable media error.

Also enable "disk scrubber" on MD and run it on cron weekly, it will read and rewrite every sector and head this off before it becomes a real problem. See Documentation/md.txt in the kernel.

[disk scrubber example] https://www.ogre.com/node/384

You still have to run smartmon all the drives (once a day, off hours), parse the output, and create alarms to head off this very problem.

Folks, this is what hardware raids do for you. The irony is, we have all the tools to provide a better MD experience, but no one puts it together into a integrated solution.

You're pretty much at the tail end of silent data corruption. A fsck might help you, but really the best thing to do is refer to your backups (you kept backups right? RAIDs are not backups) and prepare for this RAID to start sinking.

Then you'll find the bad disk.

Sorry.

ENDUPDATE-8/21

For starters, did you read the man page for badblocks for the options you used?

   -w     Use write-mode test. With this option, badblocks scans for bad  blocks  by  writing
          some  patterns (0xaa, 0x55, 0xff, 0x00) on every block of the device, reading every
          block and comparing the contents.  This option may not  be  combined  with  the  -n
          option, as they are mutually exclusive.

So your data is gone, -n was the nondestructive version. Maybe what you really did was pull a disk from the array, run badblocks on it, and then reinserted it? Please clarify.

That you don't know which disk is failed to begin with tells me that it is not an MD raid array. So whatever non-existent lvm "raid" tools exist to help you recover from this simple failure, that's what you need to figure out.

I would say that the majority of users go with an MD raid solution. The remainder get distracted by "what's this thing?" or "oh, this is LVM, it's what I'm supposed to do, right?" and later end up where you are now. I raid implementation with terrible management tools which actually created more risk than you attempted to mitigate by building a RAID 6 to begin with.

It's not your fault, you didn't know. Frankly, they should disable the thing for exactly this reason.

Concerning repairing bad blocks. You can do this by taking the machine offline and booting to a live usb drive and performing one of the following repair procedures.

https://sites.google.com/site/itmyshare/storage/storage-disk/bad-blocks-how-to

http://linuxtroops.blogspot.com/2013/07/how-to-find-bad-block-on-linux-harddisk.html

As to where this sector is in your array. Well, you would have to account for the parity rotation, which is a PITA. I would suggest that you simply verify each drive until you find the problem.

You can help prevent this in the future by enabling "disk scrubbing" in MD which reads and rewrites each sector in a maintenance window to discover exactly these sort of problems and potentially repair them.

I hope this helps.

1. I removed each disk from the array before running badblocks and re-added it after badblocks finished, so no data gone ;) 2. It is an md-array: md1 : active raid6 sdf5[6] sdd5[10] sdb5[5] sda5[8] sde5[7] 2985569280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU] 3. maybe I didn't express myself clearly: I'm not trying to recover from the badblocks, I'm trying to identify the disk that caused it. I already recovered the error by copying the original file from the master server — Toni, Aug 21 '17 at 07:19

Buffer I/O Error on md-device - can't identify failed drive

1 Answers1