Skipping bad input files in hadoop

Question

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3.

The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map input file exception.

Is there any way to have hadoop skip over the bad files?

Are you using a Java Mapper or streaming (you have hadoop-streaming tag set on this question)? — Chris White, Nov 13 '13 at 00:32
I'm using streaming (it's actually a python script that parses the logs) — Adrian Mester, Nov 13 '13 at 10:57

score 1 · Answer 1 · answered Nov 13 '13 at 12:05

1

There's a who bunch of record skipping configuration properties you can use to do this - see the mapred.skip. prefixed properties on http://hadoop.apache.org/docs/r1.2.1/mapred-default.html

There's also a nice blog post entry about this subject and these config properties:

http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code

That said, if you file is completely corrupt (i.e. broken before the first record), you might still have issues even with these properties.

answered Nov 13 '13 at 12:05

Chris White

29,949
4
71
93

I saw that, but from what I understand that only applies to corrupt rows (which I can handle myself with a try/except in python) – Adrian Mester Nov 13 '13 at 13:26
Well in that case you'll probably need to write your own InputFormat and RecordReader classes which can handle corrupt files appropriately – Chris White Nov 14 '13 at 01:03

score 0 · Answer 2 · answered Mar 22 '16 at 13:08

Chris White's comment suggesting writing your own RecordReader and InputFormat is exactly right. I recently faced this issue and was able to solve it by catching the file exceptions in those classes, logging them, and then moving on to the next file.

I've written up some details (including full Java source code) here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

Skipping bad input files in hadoop

2 Answers2