0

The Problem:

I'm in charge for a Hadoop cluster of 44 nodes. We have 1.5TB WD Green Drives with (quite unknown) the Load Cycle Count problem.

These disks work fine but as they get older they show an increasing number of bad blocks. Rewriting these bad blocks works for some time but they re-appear on different places.

As most of these disks are only used for Hadoop datanodes and we don't have the budget to replace them all I'm looking for a strategy to

  1. Not go insane mainting the cluster, disk errors and related filesystem problems appear almost daily. My current precedure is:

    • stop Hadoop services, unmount disks, locate bad blocks using dmesg output and smartctl and rewrite these bad blocks with hdparm --write-sector.
    • running fsck -f -y on the disk and remount it.
  2. Keep the system stable.

    • Hadoop takes care of disk errors (3x redudancy), but I'd rather don't want to risk corrupted filesystems.

What did I do?

At the moment I've changed the mount options to:

  • erros=continue,noatime but I get the occosial read-only remount because of journaling erros.

Then I've tried disabling the journal:

  • tune2fs -O ^has_journal this avoid read-only remounts but seems to corrupt the filesystem (which makes sense, no journal)

Now I'm thinking about switching to

  • tune2fs -o journal_data_writeback and mount with data=writeback,nobh,barrier=0

But I'm not sure if this re-introduces the read-only remounts.

So, I'd like to avoid read-only remounts, want to maintain stable filesystem metadata but don't care about errors in the data (Hadoop takes care of this). Speed should also not be impacted.

What choices do I have? I'm aware that this is probably a nightmare story for any sysadmin. OS partitions are mounted with full journaling and I'm not going to test around on production data. This is strictly for Hadoop data nodes / task tracker hard disks.

kei1aeh5quahQu4U
  • 445
  • 5
  • 22

2 Answers2

6

The best thing you can do is get the disks replaced. The cost of disks won't weigh up against the cost of the cluster being down and your amount of work time being put in to fix the bad blocks. So even without a budget I would seriously try to convince your management.

Lucas Kauffman
  • 16,880
  • 9
  • 58
  • 93
  • 1
    Also, as blocks go bad they take longer and longer to read from as it may take several attempts to get a read whose checksum lines up. Only when the device attempts to read the block more times than its configured limit is it marked as "bad". If OP does not replace the disks the performance of his storage is going to nosedive horrifically before it dies completely. – Sammitch May 15 '13 at 17:10
1

If you ABSOLUTELY need to use these drives, I'd recommend making the filesystems with mkfs -c -c… to have mkfs check for bad blocks.

You could try another filesystem like btrfs and see if that works better, but ultimately the correct answer is 'replace the disks'.

MikeyB
  • 39,291
  • 10
  • 105
  • 189