I recently lost a drive in my RAID array (and got an email from the system warning me of this, which was awfully nice) and after some drive shuffling and swapping in a new drive I'm all safe and secure. But along the way, I found this thread, which got me to thinking about how you could actually test for disk errors and other Bad Things without them actually occurring. When I ran the suggested tar command:
tar c /my/raid/device/mount/point > /dev/null
it completed in a few seconds, which is clearly not long enough for the system to have actually read all the files (well over a TiB) - so I guess my first question is why this might not have worked. If I do something like this:
find . -type f | xargs md5sum
That command runs just fine, and it takes a long time to complete... but it also loads up the CPU doing all the summing. This might or might not be faster or easier than "tar" - I'm more curious as to why the tar command didn't work as I'd expected.
Anyway - second question, and more generally: is there a way to do something along these lines to do fault injection testing:
- find (or create) a file that I don't care about...
- determine a block on the disk is used to store this particular file...
- fake the software/OS into thinking this block is "bad" (I assume by marking it somehow, this is where my knowledge runs out)
- run my test scripts and/or error checking routines
- confirm that the array both reports the error and does whatever other corrective action is necessary...
- mark that block/sector as "good" again so the system/OS uses it as normal.
This seems like something that would be doable, but I don't have enough detailed knowledge of the linux tools that would allow me to mark a block as bad at the device level without it actually BEING a bad block...
thoughts on this? Or, if there's a much more elegant way to solve this I'm happy to hear that as well...