0

I have a large number of input files in a proprietary binary format. I need to turn them into rows for further processing. Each file must be decoded in one shot by an external binary (i.e. files must not be concatenated or split).

Options that I'm aware of:

  1. Force single file load, extend RecordReader, use DistributedCache to run the decoder via RecordReader
  2. Force single file load, RecordReader returns single file, use hadoop streaming to decode each file

It looks however like [2] will not work since pig will concatenate records before sending them to the STREAM operator (i.e. it will send multiple records).

[1] seems doable, just a little more work.

Is there a better way?

corsair
  • 347
  • 3
  • 13

1 Answers1

0

Seems like Option 1 that you mentioned is the most viable option. In addition to extending RecordReader, appropriate InputFormat should be extended and override the isSplitable() to return false

Niranjan Sarvi
  • 899
  • 7
  • 9