How to work out how many mappers are needed for a MapReduce job

Question

Below I have a question that gives us this information.

Suppose the program presented in 2a) will be executed on a dataset of 200 million
recorded inspections, collecting 2000 days of data. In total there are 1,000,000 unique
establishments. The total input size is 1 Terabyte. The cluster has 100 worker nodes
(all of them idle), and HDFS is configured with a block size of 128MB.
Using that information, provide a reasoned answer to the following questions. State
any assumptions you feel necessary when presenting your answer.

Here, I'm asked to answer these questions.

1) How many worker nodes will be involved during the execution of the Map and Reduce
tasks of the job? 
2) How many times does the map method run on each physical worker?
3) How many input splits are processed at each node? 
4) How many times will the reduce method be invoked at each reducer?

Can someone verify my answers are correct?

Q1) I'm basically working out how many mappers I need? My working out is 1TB (input size) divided by the block size (128MB).

1TB / 128MB = 7812.5. Since 7812.5 mappers are needed and we only have 100 worker nodes, all 100 nodes are gonna be used correct?

Q2) From Q1 I figured out 7812.5 mappers are needed, so each map method will be run 7812.5 (round up to 7813) times on each pyhsical worker.

Q3) The input splits are same as the number of mappers, so there will be 7813 splits.

Q4) Since I'm told there are 1,000,000 unique values and the default number of reducers is 2. The reduce method will run 500,000 times on each reducer.

Can someone go through my reasoning and see if I'm correct? Thanks

I believe the 2nd question asks how many _instances_ of a Map function will be running for a single node, so you may have to divide 7813/100 (aka the number of worker nodes) which results to ~79 mappers. The 3rd question is vague so it actually ask how many _records_ each node will have to compute, so you just have to divide (200mil records) / (num of worker nodes) = 200mil/100 = 2 mil records per worker. About the 4th, I don't know what do you mean by "_the default number of reducers is 2_" so additional info might be needed for this one. — Coursal, Dec 27 '20 at 20:15
For question 4, if the amount of reducers isn't specificed by the user in the code then the default amount of reducers that run will be 2. So since each reducers only is called on a unique value, itll be 1,000,000 divided by 2. So at each reducer the method will be invoked 500,000 times — Hassan, Dec 27 '20 at 21:01

How to work out how many mappers are needed for a MapReduce job

0 Answers0