How does one read binary input files in mrjob?

Question

The input to my MapReduce program is a set of binary files. I want to be able to read them through mrjob. After some research it seems I have to write a custom hadoop streaming jar. Is there a simpler way? Or is such a jar readily available? Further details below.

The input files are just a sequence of 8-byte integers. I want my mapper function to be called with 2 integers at a time.

I first thought that I could convert into the pickle binary format and then specify:

INPUT_PROTOCOL = mrjob.protocol.PickleProtocol.

But that gives an error: Unable to decode input. I also feel that mrjob would only work with the pickle ascii format (and not binary). Because otherwise how would hadoop streaming deal with bytes that look like the newline. The mrjob source code seems to confirm that.

The other option is to write a custom hadoop streaming jar. mrjob has an option to specify such a jar. But as someone unfamiliar with hadoop/Java, I would prefer a python based solution.

score 0 · Answer 1 · answered May 19 '14 at 16:13

On further research I found helpful posts which may not directly solve my problem, but address the overall issue.

Essentially search for "binary data" on mrjob mailing list. (Yes, this was rather obvious; just adding it here in case someone finds this question first.)

More importantly, perhaps, this is issue 715 on mrjob.

How does one read binary input files in mrjob?

1 Answers1