1

I would like to use Hadoop Map/Reduce to process delimited Protocol Buffer files that are compressed using something other than LZO, e.g. xz or gzip. Twitter's elephant-bird library seems to mainly support reading protobuf files that are LZO compressed and thus doesn't seem to meet my needs. Is there an existing library or a standard approach to doing this?

(NOTE: As you can see by my choice of compression algorithms, it's not necessary for the solution to make the protobuf files splittable. Your answer doesn't even need to specify a particular compression algorithm, but should allow for at least one of the ones I mentioned.)

Josh Hansen
  • 1,408
  • 2
  • 17
  • 20

1 Answers1

1

You may want to look into the RAgzip patch for Hadoop for processing multiple map tasks for a large gzipped file: RAgzip

fjxx
  • 945
  • 10
  • 20