1

I have two relatively new 4T hard drives (WD Data Center Re WD4000FYYZ) formatted as btrfs with raid1 data and raid1 metadata.

I copied a large binary file to the volume (~76 GB). Soon after copying the file, I ran a btrfs scrub. There were no errors.

A few months later, a scrub returned an unrecoverable error on that file. It has not been modified since it was originally copied. I might add that the SMART attributes for both drives do not indicate any errors (Current_Pending_Sector or otherwise).

The system with the drives does not have ECC memory.

The only thing that I can think of that might cause this kind of error is that in writing to another file whose data checksums were contained in the same block as some of the checksums for the big file, some corruption occurred in memory that allowed bad data to pollute one or more of the checksums for the big file.

Unfortunately, I was hoping in migrating to btrfs that once data was loaded and scrubbed successfully, you could be confident that it would remain so if it were not written to (in raid1/5/6 configuration, of course). Obviously, this is not the case.

Can anyone explain how this could have happened? Also, if I had taken a snapshot of the volume that contained the big file, would I still have had access to the original, uncorrupted data from the snapshot?

R. Lochner
  • 21
  • 5
  • Have you run memtest? Maybe badblocks? Was the filename mentioned in dmesg? Is this in a vm, by any chance? Are other files/inodes corrupted as well? Did anything special happen right before it was corrupted, was the system under high load or something? – basic6 Aug 08 '16 at 07:57
  • I had extensive discussions on the btrfs mailing list after I posted this. I did, in fact, have a bad memory chip. Occasionally, a bit or bits would flip corrupting a checksum block. The data itself was good, but the mirrored checksums were bad due to a silent memory error. I have replaced the ram and the problem has not reappeared. – R. Lochner Aug 09 '16 at 13:14
  • Well, that explains it. Bad memory can cause all kinds of damage. This wouldn't happen because of btrfs. In fact, btrfs helped you find the memory issue and it also told you which files have been corrupted. I suggest you post this as an answer to your question. – basic6 Aug 10 '16 at 10:40

1 Answers1

1

This silent data corruption was caused by a bad memory stick. The memory was replaced and the problem has not reappeared.

R. Lochner
  • 21
  • 5