0

I am writing Hadoop seq file using txt as input. I know how to write Sequence file from text file.

But i want to limit the output sequence file to some specific size say, 256MB.

Is there any inbuilt method to do this?

Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48

1 Answers1

1

AFIAK you'll need to write your own custom output format to limit output file sizes - by default FileOutputFormats create a single output file per reducer.

Another option is to create your sequence files as normal, then then a second job (map only), with identity mappers and then amend the minimum / maximum input split size to ensure that each mapper only processes ¬256MB each. This will mean a input file og 1GB would be processed by 4 mappers and create output files of ¬256MB. You will get smaller files where an input file is say 300MB (256MB mapper and a 44MB mapper will run).

The properties you are looking for are:

  • mapred.min.split.size
  • mapred.max.split.size

They are both configured as byte sizes, so set them both to 268435456

Chris White
  • 29,949
  • 4
  • 71
  • 93