22

In our application we use Hibernate and PostgreSQL to store data. In one of our database tables we have a discriminator column which says for example "TIPPSPIEL". It is a fixed string and can not be manipulated by any user.

Suddenly we had one entry in this huge table where we had "TIPPQPIEL" instead of "TIPPSPIEL". We have no clue how this can happen.

Is it possible by any means that our hard disk is switching one bit, so our letter "S" is no longer encoded as "1010001" but suddenly becomes a "Q" on the hard disk with one bit switched like this: 1010011?

I am not an expert on hard disk an bit physics but I guess an OS or a disk has checksums and other stuff to ensure that this can't happen.

Is it possible that just one bit switches so my file shows me a letter "Q" instead of a "S"?

UPDATE: We made further analyse. Our slave database gets its WAL Records from master (PostgreSQL feature). Whatever: our slave server should be in sync. But the slave wasn't in sync regarding this particular row. We could see that it happened a few days ago without any interaction from a user on this particular entry. So it MUST be a bit flipping around. scary!

rszalski
  • 119
  • 1
  • 9
Janning
  • 1,421
  • 2
  • 21
  • 38
  • I'd rather assume this came from a faulty memory. Do you still have the log, when that column was written? – ott-- Jul 12 '13 at 09:19
  • 1
    Its unlikely but possible, bits in transit get flipped with a high degree of regularity, see 'bitsquatting' – Sirch Jul 12 '13 at 09:39

1 Answers1

10

It's so rare we see a genuinely interesting question on this site, so thank you first of all.

I think what you're seeing there is indeed a single-bit error, amazing you could spot it to be honest but you're correct in assuming that the second-least-significant-bit has been switched (assuming you're using ASCII anyway).

As for checksums etc. when it was written to the disk it's likely it will have been verified as fine - I'm pretty sure this problem has developed afterwards via a simple magnetic-leakage error. But you're right, there are encoding checks done, it varies from manufacturer but there's probably a error somewhere saying 'this looks a bit odd' - but what option does your IO chain have available? deny you the whole block? I'm going to assume this is a single non-RAIDed disk as they RAIDed disks tend to have more options available to them when they detect errors.

It's an odd one, though this kind of thing probably happened multiple times a second across the world.

Chopper3
  • 101,299
  • 9
  • 108
  • 239
  • 1
    You are right, it was a non-Raid disk setup in this case. as my further analyse shows it happened long after the record was written. – Janning Jul 12 '13 at 13:22
  • 1
    If my 20 years as a sysadmin I have seen 3 cases of a single bit-flip. Only one of those could be proven 100%. The other 2 were suspected to be flipped bits, we couldn't tell for certain. (Bit could have flipped in memory after reading the file. By the time we noticed the discrepancy the original file was not available anymore or had been touched. I'm quite sure it happens more often than every one thinks, but it is rarely noticed and usually not provable if it is noticed. – Tonny Jul 12 '13 at 13:41
  • 1
    Failing the whole block read is exactly what drives do when they get an uncorrectable error. It is impossible to have only a single bit flip in the user data part of the sector, and go undetected. The bit must have been flipped when it was written to the disk. – psusi Jul 12 '13 at 14:34
  • Should this question be made canonical? – Deer Hunter Jul 14 '13 at 09:16
  • @psusi Not impossible, as you just need enough bit-flips in the sector to make the ECC come out right. Unlikely, but possible, and disk manufacturers quote high enough error rates that you really ought to expect to see some. I've heard rumors that ZFS folks see them (due to ZFS-level data checksums)... – derobert Jul 15 '13 at 19:58
  • @derobert, hence why I qualified my statement with *only* a single bit flip. It is theoretically possible ( though you have a better chance to win the lottery and get struck by lightning on the same day ) to have the raw bits scrambled in such a way as to flip the target bit after the reed-solomon error correction, *and* then satisfy the ecc, but it involves flipping *many* bits, not just the one. This would leave the rest of the contents of that sector unrecognizable. – psusi Jul 15 '13 at 20:22