0

Recently I came in contact with what appears to be disk corruption scenarios, and I would like to understand them better.

I have a build server which I work with daily. During one full build of a recent LLVM release which stopped with a strange error message, I got this excerpt for one generated file (X86GenDisassemblerTables.inc):

...
/* 0xa5 */
{ /* ModRMDecision */
 MODRM_ONEENTRY,
 0 /* EmptyTable */
},
/* 0xa6 */
{ /* ModRMDecision */
 MODÒM_ONEENTRY,                # Ò = 0xD2
 0 /* EmptyTable */             # R = 0x52
},
/* 0xa7 */
{ /* ModRMDecision */
 MODRM_ONEENTRY,
 0 /* EmptyTable */
},
...

This seems to be a single-bit file corruption. I removed the file, the build generated it again and completed successfully.

And today, in a different machine, this .d file was produced during a build:

output-gcc-8.2.0-x86_64-linux-gnu/obj/headers.hpp.gch: src/headers.hpp
pp      # What's this?

Everything else -- file size, permissions, even the terminating newline -- was in place. Removing the file also allowed the build to generate it again without problems.

Are these legitimate cases of disk corruption? Which tools can I use to diagnose this? These disks are, respectively, one and two-year old SSDs running ext4 file systems.

alecov
  • 572
  • 1
  • 6
  • 13

1 Answers1

3

You might want to start with a RAM test. Hard dives typically know when they have a read or write failure. If you're not already receiving hard drive errors in the kernel messages and you're not using ECC RAM, I would suspect the RAM over the hard drive.

longneck
  • 23,082
  • 4
  • 52
  • 86
  • Thanks for your answer. I suspected it might have been the RAM also. Although it doesn't have ECC chips, the server went through a three-day memory test a couple of years ago before deployment, so I ruled out the possibility of RAM corruption in the short term (since lifespans of memory chips are quite long). I will definitely run a battery of memory tests in this other machine I've mentioned in the question. – alecov Sep 06 '18 at 19:06