Why do we need setup() method in MapReduce when we can initialize parameters in map() or reduce()?

Question

I am new to Hadoop and overall MapReduce paradigm. I searched a lot on the web regarding overriding the setup() method in Map class to access the configuration object. But from what I read, it seems that the setup() method is anyways called every time a task is run.

So why is the need for a seperate method to access configuration object and initialize parameters? Why cant we do the same directly in map() or reduce() methods?

Though both the approaches will give output as required in the end, is there a performance factor that comes into picture while choosing any one approach? Thanks in advance.

score 0 · Answer 1 · answered Oct 28 '16 at 17:54

the answer lies not in Hadoop, but in programming paradigm in my opinion. It is always good to separate different parts of the business logic, and setting up the running environment is different then running the map itself.

Imagine a scenario when you have certain data on which you wish to do multiple calculations, in this case if you have a parent class for your jobs, in which you can do the common setup phases by overriding a separate method it is better.

The design just encourages this behaviour which you would choose otherwise as well.

score 0 · Answer 2 · answered Oct 28 '16 at 17:55

0

You will have to check in map() or reduce() whether you already initialized parameters or not so it simplifies initialization process for you by dividing initialization and actual map logic phases.

answered Oct 28 '16 at 17:55

Filipp Voronov

4,077
5
25
32

score 0 · Answer 3 · answered Oct 28 '16 at 18:08

0

I'm not sure if I'm right but as far as I understand map() and reduce() are executed in nodes in distributed network where nodes do not have knowledge about whole system. So what you have access inside map() reduce() methods is not what is configured in main node. You can't just have access to whole configuration in node because it means you need to connect to main node whole time.

answered Oct 28 '16 at 18:08

Kacper

4,798
2
19
34

The job configuration is global, and is available for the nodes working on a particular phase. When a particular mapper runs on a node the Mapper class setup and then map is called with the proper input split. – pifta Oct 28 '16 at 21:40

score 0 · Accepted Answer · answered Oct 29 '16 at 23:35

Re: "it seems that the setup() method is anyways called every time a task is run."

Whenever a task is run, number of records are processed by the corresponding Map or Reduce task. The map() or reduce() method is called for every record being processed. However setup() method is run once per task giving you opporunity to optimize the workflow by initializing configurations/resources such as ( Database connection, reading a reference file etc.) only once per all the records being processed by that task.

Similarly, the API provides a callback named "cleanup" where you can clean up the resources. This will be invoked when the task has finished processing records allocated for that task.

Why do we need setup() method in MapReduce when we can initialize parameters in map() or reduce()?

4 Answers4