How to ignore index files of MapFiles when running MR jobs?

Question

I have a directory full of MapFiles. I now want to run a MR Job on them. I use the SequenceFileInputFormat of the new API which should be aware of MapFiles as one answer in this thread states. But however, this does not work. The job runs up to a certain percentage and after that, I get

Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to com.mycompany.MyOwnWritable

I suppose the mapper trips over the index file. How can I make sure these are ignored, or better, only files which have the correct input key and value classes are used? The only way that comes to mind is overriding Mapper<Object, Object, MyKeyOut, MyValueOut> and using ifs and instanceof checks, but I consider this ugly. Is there a better way to do this?

rabejens · Accepted Answer · 2014-11-11T15:58:45.067

I found out where it trips over the index files. When enabling recursive traversal of the input paths (by setting mapreduce.input.fileinputformat.input.dir.recursive to true) the files for the map tasks are gathered by walking down the file and directory tree. The SequenceFileInputFormat then receives the individual files instead of the directories and thus, the MapFile detection fails. It only works if the input format receives the directory containing the two files composing the MapFile. When turning off recursion and ensuring the layout MR expects - i.e., a directory where all MapFiles to be processed are stored "flat" without an additional folder structure - or adding all directories containing MapFiles manually by calling FileInputFormat.addInputPath for every such directory, the job runs without failing.

EDIT: Reported as a bug: MAPREDUCE-6155

How to ignore index files of MapFiles when running MR jobs?

1 Answers1