Novice btrfs user: checksum failures and input/output errors galore

Question

A housemate suggested to me that I ought to use btrfs instead of what I've been doing up until now, which is using mdadm with cloned drives, and adding an extra drive into the array to "clone" a backup. The system has three drives, all physically different models:

/dev/sda: TOSHIBA HDWQ140
/dev/sdb: HGST HUS724040AL
/dev/sdc: WDC WDS250G2B0B

Well I've installed btrfs but now it's been running for close to a year and I find out that I should have had a weekly cron job running to "scrub" it. I started trying to set up a script for this, although it seems like a stupidly DIY system that requires you to google a script (the top hit I found was from something like 2014) and install it to keep your filesystem running.

While I was doing all this admin stuff, I found some files that needed to be moved... I'll skip the gory details, but moving the files from one btrfs filesystem to another and back again generated all sorts of "input/output errors" (never seen that with ext4), and even this gem:

Jan  4 21:19:19 host kernel: [9771285.171522] attempt to access beyond end of device
Jan  4 21:19:19 host kernel: [9771285.171522] sda1: rw=1, want=70370535518208, limit=7814035087
Jan  4 21:19:19 host kernel: [9771285.171529] BTRFS error (device sda1): bdev /dev/sda1 errs: wr 1, rd 0, flush 0, corrupt 5, gen 0

I'm assuming these are related. But here's the real stupid thing. I'm getting checksum errors not just on files that have been sitting around for a year, but on files that I literally copied just hours ago to a different physical drive. Also, nearly all of them are on enormous files (things like DVD iso images) if that is any indication of anything?

So yeah, I could be seeing a simultaneous triple drive failure or does btrfs just go around corrupting my files for me?

Also, every post from the knowledgeable btrfs folks includes a cute little "well, you should restore that from backups... you do have backups, don't you". So tell me folks, what exactly do you use to backup a 4TB hard drive? Because I can't exactly, you know, write it out to a DVD, and if hard drives are this unreliable then what good are backups to hard drives?

So serious questions:

Are these checksum errors really normal and expected?
Why am I seeing them on files that were only copied today?
Will regular scrubs be enough to protect against this?
Should I buy new hard drives and throw out all the ones currently in the machine because they really are failing?
How do you recommend backing up multiple-terabyte data drives?

Update 2022-01-07: I ran smartctl on all of the drives and these are reporting no problems at all. Raw UDMA_CRC_Error_Count is 0 for all drives. Tried to restore corrupted files... the tar file copied to machine failed after a few files with an I/O error. Really no idea what's going on here:

If the drives or the cables were bad, this would show up in SMART, right?
If the CPU or the memory were bad, the system wouldn't be running flawlessly? (Currently up 115 days with no obvious issues)?
If this were an across-the-board bug with btrfs, wouldn't it be all over the internet?

So where could the problem actually be?

Well, there is a reason why all serious distributions that tried btrfs turned away from it. It's just not stable enough to use it in production. — Gerald Schneider, Jan 05 '22 at 05:46
Regarding: "How do you back up 4TB": 4TB is nothing. You get that in a regular consumer PC nowadays. You back that up to another 4TB disk (or more in a redundant RAID), or preferably a larger disk array that allows you to do incremental backups. It really depends on your threat model how you back up, if you only want to protect against hardware failure (a single second disk is enough) or against other data loss (crypto trojan, accidential deletes, etc.) — Gerald Schneider, Jan 05 '22 at 08:03

score 0 · Accepted Answer · answered Jan 10 '22 at 15:33

I'm answering my own question because I think this is sort of interesting and might be of use to someone.

TL;DR The root cause of the reported problems appears to have been failing DRAM, not failing hard drives.

No these checksums are not normal and expected. Another system running the same btrfs version was working perfectly well. They indicate something wrong, but not necessarily with the disks. See next item.
They're showing up on newly copied data, because there's a major failure of the DRAM in the system, confirmed by X86MemTest. Only one of the two sticks was bad, and it happened that it was the stick mapped to higher memory, so only when the low memory was all used (rarely, but more frequently for larger files) did the failures bite. This is why they didn't affect the kernel.
Regular scrubs might have detected the problem earlier. Regular scrubs don't help when you have a drive (e.g. /dev/sdc) which is not part of a mirror, because although it can see a checksum error, it doesn't have any hope of correcting it - this is fundamentally a limitation of btrfs, where they could have elected a checksum function with a larger hamming distance, but instead elected one that was faster to compute (I believe).
I bought new hard drives, which can serve as backups, but various SMART tests and other efforts suggest the current drives are probably OK. The "all drives failing at once" is probably a good clue that the problem isn't the hard drives.
As noted, large drives have become cheap... and given that the drives themselves don't seem to be the failure point, the idea of using hard drives for backup seems to have continued validity.

This is one of the reasons that ECC memory is *highly* recommended on BTRFS system (and ZFS systems). These data management systems do a great job of handling problems in secondary/tertiary storage, but primary storage errors will smash the greatest of arrays. — Spooler, Jan 10 '22 at 16:38
You could also consider clustering to solve this issue, performing checksums across three independent systems to establish data integrity consensus and prevent any one system failure for destroying data. This is not cheaper than ECC RAM, but in some cases it can make sense to form a cluster rather than invest more in single nodes (if the needed consumer hardware is already there, for example). Neither of these filesystems can cluster across nodes on their own, so what I'm suggesting would have to be done using something like GlusterFS or DRBD (what I'm suggesting is also not simple). — Spooler, Jan 10 '22 at 16:41
Thanks @Spooler. I *thought* my motherboard (Gigabyte Aorus B450) supported ECC RAM, but in the fine print it says "in non-ECC mode". Just to be clear, this is a home-based server, so I don't exactly have a lot of space to set up a cluster of machines. — Greg Nelson, Jan 11 '22 at 21:41

Novice btrfs user: checksum failures and input/output errors galore

1 Answers1