0

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3.

The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map input file exception.

Is there any way to have hadoop skip over the bad files?

Adrian Mester
  • 2,523
  • 1
  • 19
  • 23

2 Answers2

1

There's a who bunch of record skipping configuration properties you can use to do this - see the mapred.skip. prefixed properties on http://hadoop.apache.org/docs/r1.2.1/mapred-default.html

There's also a nice blog post entry about this subject and these config properties:

That said, if you file is completely corrupt (i.e. broken before the first record), you might still have issues even with these properties.

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • I saw that, but from what I understand that only applies to corrupt rows (which I can handle myself with a try/except in python) – Adrian Mester Nov 13 '13 at 13:26
  • Well in that case you'll probably need to write your own InputFormat and RecordReader classes which can handle corrupt files appropriately – Chris White Nov 14 '13 at 01:03
0

Chris White's comment suggesting writing your own RecordReader and InputFormat is exactly right. I recently faced this issue and was able to solve it by catching the file exceptions in those classes, logging them, and then moving on to the next file.

I've written up some details (including full Java source code) here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

John Chrysostom
  • 3,973
  • 1
  • 34
  • 50