A housemate suggested to me that I ought to use btrfs instead of what I've been doing up until now, which is using mdadm with cloned drives, and adding an extra drive into the array to "clone" a backup. The system has three drives, all physically different models:
- /dev/sda: TOSHIBA HDWQ140
- /dev/sdb: HGST HUS724040AL
- /dev/sdc: WDC WDS250G2B0B
Well I've installed btrfs but now it's been running for close to a year and I find out that I should have had a weekly cron job running to "scrub" it. I started trying to set up a script for this, although it seems like a stupidly DIY system that requires you to google a script (the top hit I found was from something like 2014) and install it to keep your filesystem running.
While I was doing all this admin stuff, I found some files that needed to be moved... I'll skip the gory details, but moving the files from one btrfs filesystem to another and back again generated all sorts of "input/output errors" (never seen that with ext4), and even this gem:
Jan 4 21:19:19 host kernel: [9771285.171522] attempt to access beyond end of device
Jan 4 21:19:19 host kernel: [9771285.171522] sda1: rw=1, want=70370535518208, limit=7814035087
Jan 4 21:19:19 host kernel: [9771285.171529] BTRFS error (device sda1): bdev /dev/sda1 errs: wr 1, rd 0, flush 0, corrupt 5, gen 0
I'm assuming these are related. But here's the real stupid thing. I'm getting checksum errors not just on files that have been sitting around for a year, but on files that I literally copied just hours ago to a different physical drive. Also, nearly all of them are on enormous files (things like DVD iso images) if that is any indication of anything?
So yeah, I could be seeing a simultaneous triple drive failure or does btrfs just go around corrupting my files for me?
Also, every post from the knowledgeable btrfs folks includes a cute little "well, you should restore that from backups... you do have backups, don't you". So tell me folks, what exactly do you use to backup a 4TB hard drive? Because I can't exactly, you know, write it out to a DVD, and if hard drives are this unreliable then what good are backups to hard drives?
So serious questions:
- Are these checksum errors really normal and expected?
- Why am I seeing them on files that were only copied today?
- Will regular scrubs be enough to protect against this?
- Should I buy new hard drives and throw out all the ones currently in the machine because they really are failing?
- How do you recommend backing up multiple-terabyte data drives?
Update 2022-01-07: I ran smartctl
on all of the drives and these are reporting no problems at all. Raw UDMA_CRC_Error_Count is 0 for all drives. Tried to restore corrupted files... the tar file copied to machine failed after a few files with an I/O error. Really no idea what's going on here:
- If the drives or the cables were bad, this would show up in SMART, right?
- If the CPU or the memory were bad, the system wouldn't be running flawlessly? (Currently up 115 days with no obvious issues)?
- If this were an across-the-board bug with btrfs, wouldn't it be all over the internet?
So where could the problem actually be?