Below I have a question that gives us this information.
Suppose the program presented in 2a) will be executed on a dataset of 200 million
recorded inspections, collecting 2000 days of data. In total there are 1,000,000 unique
establishments. The total input size is 1 Terabyte. The cluster has 100 worker nodes
(all of them idle), and HDFS is configured with a block size of 128MB.
Using that information, provide a reasoned answer to the following questions. State
any assumptions you feel necessary when presenting your answer.
Here, I'm asked to answer these questions.
1) How many worker nodes will be involved during the execution of the Map and Reduce
tasks of the job?
2) How many times does the map method run on each physical worker?
3) How many input splits are processed at each node?
4) How many times will the reduce method be invoked at each reducer?
Can someone verify my answers are correct?
Q1) I'm basically working out how many mappers I need? My working out is 1TB (input size) divided by the block size (128MB).
1TB / 128MB = 7812.5. Since 7812.5 mappers are needed and we only have 100 worker nodes, all 100 nodes are gonna be used correct?
Q2) From Q1 I figured out 7812.5 mappers are needed, so each map method will be run 7812.5 (round up to 7813) times on each pyhsical worker.
Q3) The input splits are same as the number of mappers, so there will be 7813 splits.
Q4) Since I'm told there are 1,000,000 unique values and the default number of reducers is 2. The reduce method will run 500,000 times on each reducer.
Can someone go through my reasoning and see if I'm correct? Thanks