The input to my MapReduce program is a set of binary files. I want to be able to read them through mrjob. After some research it seems I have to write a custom hadoop streaming jar. Is there a simpler way? Or is such a jar readily available? Further details below.
The input files are just a sequence of 8-byte integers. I want my mapper function to be called with 2 integers at a time.
I first thought that I could convert into the pickle binary format and then specify:
INPUT_PROTOCOL = mrjob.protocol.PickleProtocol.
But that gives an error: Unable to decode input. I also feel that mrjob would only work with the pickle ascii format (and not binary). Because otherwise how would hadoop streaming deal with bytes that look like the newline. The mrjob source code seems to confirm that.
The other option is to write a custom hadoop streaming jar. mrjob has an option to specify such a jar. But as someone unfamiliar with hadoop/Java, I would prefer a python based solution.