0

Summary from the book "hadoop definitive guide - tom white" is:

All the logic between user's map function and user's reduce function is called shuffle. Shuffle then spans across both map and reduce. After user's map() function, the output is in in-memory circular buffer. When the buffer is 80% full, the background thread starts to run. The background thread will output the buffer's content into a spill file. This spill file is partitioned by key. And within each partition, the key-value pairs are sorted by key.After sorting, if combiner function is enabled, then combiner function is called. All spill files will be merged into one MapOutputFile. And all Map tasks's MapOutputFile will be collected over network to Reduce task. Reduce task will do another sort. And then user's Reduce function will be called.

So the questions are:

1.) According to the above summary, this is the flow:

Mapper--Partioner--Sort--Combiner--Shuffle--Sort--Reducer--Output

1a.) Is this the flow or is it something else?

1b.) Can u explain the above flow with an example say word count example, (the ones I found online weren't that elaborative) ?

2.) So the mappers phase output is one big file (MapOutputFile)? And it is this one big file that is broken into and the key-value pairs are passed onto the respective reducers?

3.) Why does the sorting happens for a second time, when the data is already sorted & combined when passed onto their respective reducers?

4.) Say if mapper1 is run on Datanode1 then is it necessary for reducer1 to run on the datanode1? Or it can run on any Datanode?

Community
  • 1
  • 1
bhoots21304
  • 47
  • 11

1 Answers1

0

Answering this question is like rewriting the whole history . A lot of your doubts have to do with Operating System concepts and not MapReduce.

  1. Mappers data is written on local File System. The data is partitioned based on the number of reducer. And in each partition , there can be multiple files based on the number of time the spills have happened.
  2. Each small file in a given partition is sorted , as before writing the file, in Memory sort is done.
  3. Why the data needs to be sorted on mapper side ? a.The data is sorted and merged on the mapper side to decrease the number of files. b.The files are sorted as it would become impossible on the reducer to gather all the values for a given key.
  4. After gathering data on the reducer, first the number of files on the system needs to be decreased (remember uLimit has a fixed amount for every user in this case hdfs)
  5. Reducer just maintains a file pointer on a small set of sorted files and does a merge of them.

To know about more interesting ideas please refer : http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

KrazyGautam
  • 2,839
  • 2
  • 21
  • 31