I have a large number of input files in a proprietary binary format. I need to turn them into rows for further processing. Each file must be decoded in one shot by an external binary (i.e. files must not be concatenated or split).
Options that I'm aware of:
- Force single file load, extend RecordReader, use DistributedCache to run the decoder via RecordReader
- Force single file load, RecordReader returns single file, use hadoop streaming to decode each file
It looks however like [2] will not work since pig will concatenate records before sending them to the STREAM operator (i.e. it will send multiple records).
[1] seems doable, just a little more work.
Is there a better way?