0

I have a map-reduce job whose input is a big data set (let's say of size 100GB). What this map-reduce job does is splitting the big data into chunks and writing separate files, one per each data chunk. That is, the output of the job is multiple files, each of size 64MB.

The output of this map-reduce job is used as the input for another map-reduce job. As the new input consists of multiple files, each of size 64MB, does each mapper in the second map-reduce job reads only one file or it might reads more than 1 file?

HHH
  • 6,085
  • 20
  • 92
  • 164

1 Answers1

1

By default the JobTracker will assign a map task to one block. You can used CombineFileInputFormat to get around this behavior and blob multiple blocks into one input split (but that's not what you are asking).

In your situation, if your files go over 64MB and your default block size is 64MB, you can end up with two blocks per ~64MB file, which is probably bad. If all your files are below the block size, you should get one mapper per block.

I wonder why you have the first mapreduce job at all. You are basically recreating something Hadoop does for you for free. If you have a bunch of large files that add up to 100GB, let Hadoop's blocks do that "splitting" for you. For example, a 140MB file that is using a block size of 64MB will be automatically split into a 64MB, 64MB, and 12MB chunks. Three mappers will spawn to tackle that one file.

Donald Miner
  • 38,889
  • 8
  • 95
  • 118