How to implement a map reduce job which will split a file into smaller sub files so that it can be read in memory

Question

I am trying to write a map reduce job in python.The first mapper will be splitting the files into multiple subfiles and the reducer will do some manupulation on the same files and combine it How do I write to split the files randomly in python in first map reduce and moreover I was thinking of using os module and split commmand to split it ,but my confusion is if i split it in suppose 30 parts ,how do I ensure the 30 part will be processed in a same way ,or is it the case that hadoop ensures the concurrency ?

for better understanding of my confusion : suppose I split the file in k parts in map job ,what information do I need to pass to he reduce job to make it operate on each split file

The person who have downvoted ,Please care to explain a solution or show a similar question rather than only downvoting a legitimate question — Bg1850, Nov 09 '14 at 11:18

score 0 · Answer 1 · answered Nov 10 '14 at 07:05

I am assuming,128Mb file as input and you want to do some calculation on it. The hadoop flow will be like this. for mapper input split will be 2,so two mapper will run on 2 (64mb) blocks.now in mapper you will write your logic than and it will output the key/value pair.now the key/value from both the blocks will be go through the combiner,shuffle,sorting,and 1 reducer(default) depending upon your usecase.and at last you will get the desired output. so file split and segregate handled by hadoop framework.

Regards

Jyoti Ranjan Panda

How to implement a map reduce job which will split a file into smaller sub files so that it can be read in memory

1 Answers1