0

I want 20 % of sample data from the input dataset.

I thought of 2 approaches:

  1. Initially emitting 20 % data from each mapper (single mapper emits 20% of data).Then, the reducer finds 20 % of mapper data after shuffle and sort.(Same procedure applied for both Map and Reduce)

  2. Simply emit each line from mapper and then find 20% of sample data from total data in Reducer.(processing only done is Reducer)

Which is the better approach?

USB
  • 6,019
  • 15
  • 62
  • 93
  • I don't quite understand your first approach, can you rephrase it maybe? – Mike Park Jun 25 '14 at 21:51
  • I edited.Hope that makes clear.Else please ping me. – USB Jun 26 '14 at 03:46
  • In case 1 if you apply same procedure on both map and reduce side then you will be sampling only 4% of the total data.In second case you will be sampling 20% of total data.Please think about that change. – donut Jun 26 '14 at 16:43

1 Answers1

0

I would definitely go with your first option. I'm not sure why you need a reducer though. Just filter out 20% in the map phase and call it a day.

Mike Park
  • 10,845
  • 2
  • 34
  • 50
  • but there will be as many output equal to number of mapper.Each mapper will be giving 20% of data – USB Jun 26 '14 at 08:27
  • Yes there will be as many files as there are mappers, but each file is 20% of the data given to the mapper. Not 20% of the entire dataset. – Mike Park Jun 26 '14 at 13:32