How to decode a binary file which must be decoded using an external binary in one shot?

Question

I have a large number of input files in a proprietary binary format. I need to turn them into rows for further processing. Each file must be decoded in one shot by an external binary (i.e. files must not be concatenated or split).

Options that I'm aware of:

Force single file load, extend RecordReader, use DistributedCache to run the decoder via RecordReader
Force single file load, RecordReader returns single file, use hadoop streaming to decode each file

It looks however like [2] will not work since pig will concatenate records before sending them to the STREAM operator (i.e. it will send multiple records).

[1] seems doable, just a little more work.

Is there a better way?

score 0 · Answer 1 · answered May 01 '13 at 02:40

0

Seems like Option 1 that you mentioned is the most viable option. In addition to extending RecordReader, appropriate InputFormat should be extended and override the isSplitable() to return false

answered May 01 '13 at 02:40

Niranjan Sarvi

899
7
9

How to decode a binary file which must be decoded using an external binary in one shot?

1 Answers1