So this has always confused me. I'm not sure exactly how map-reduce works and I seem to get lost in the exact chain of events.
My understanding:
- Master Chunks up files and hands them to mappers as (K1, V1)
- Mappers will take files and perform a Map(K1,V1)-> (K2,V2) and output this data into individual files.
- THIS IS WHERE I'M LOST.
- So do these individual files get combined some how? What if keys are repeated in each file?
- Who does this combining? Is it the master? If all the files go into the Master at this step, wont their be a massive bottleneck? Does it all get combined into one file? Are the files re-chunked and handed to the reducers now?
- OR, If all the files go directly to the reducers instead, what happens with the repeated K3's in the (K3, V3) files at the end of the process? How are they combined? Is there another Map-Reduce phase? And if so, do we need to create new operations: Map(K3,V3)->(K4,V4), Reduce(K4,V4)->(K3,V3)
I think to sum up, I just dont get how the files are being re-combined properly and its causing my map-reduce logic to fail.