better way of sampling in Hadoop MapReduce

Question

I want 20 % of sample data from the input dataset.

I thought of 2 approaches:

Initially emitting 20 % data from each mapper (single mapper emits 20% of data).Then, the reducer finds 20 % of mapper data after shuffle and sort.(Same procedure applied for both Map and Reduce)
Simply emit each line from mapper and then find 20% of sample data from total data in Reducer.(processing only done is Reducer)

Which is the better approach?

I don't quite understand your first approach, can you rephrase it maybe? — Mike Park, Jun 25 '14 at 21:51
In case 1 if you apply same procedure on both map and reduce side then you will be sampling only 4% of the total data.In second case you will be sampling 20% of total data.Please think about that change. — donut, Jun 26 '14 at 16:43

score 0 · Accepted Answer · answered Jun 26 '14 at 04:03

0

I would definitely go with your first option. I'm not sure why you need a reducer though. Just filter out 20% in the map phase and call it a day.

answered Jun 26 '14 at 04:03

Mike Park

but there will be as many output equal to number of mapper.Each mapper will be giving 20% of data – USB Jun 26 '14 at 08:27
Yes there will be as many files as there are mappers, but each file is 20% of the data given to the mapper. Not 20% of the entire dataset. – Mike Park Jun 26 '14 at 13:32

1 Answers1