2

I am new to hadoop and I am learning by using few examples. I am currently trying to pass a file with random integers on it. For each and every number i want it to be double base on the number specify by the user at runtime.

3536 5806 2545 249 485 5467 1162 8941 962 6457 665 6754 889 5159 3161 5401 704 4897 135 907 8111 1059 4971 5195 3031 630 6265 827 5882 9358 9212 9540 676 3191 4995 8401 9857 4884 8002 3701 931 875 6427 6945 5483 545 4322 5120 1694 2540 9039 5524 872 840 8730 4756 2855 718 6612 4125

Above is the file sample.

For example when the user specify at runtime

 jar ~/dissertation/workspace/TestHadoop/src/DoubleNum.jar DoubleNum Integer Output 3

the output for say the first line will be 3536 * 8 5806* 8 2545* 8 249* 8 485* 8 5467* 8 1162* 8 8941* 8 962* 8 6457* 8

Because for each iteration the number will be double so for 3 iterations it will be 2^3. How can I achieve this using mapreduce?

asembereng
  • 675
  • 2
  • 8
  • 18
  • are you sure mapreduce is the right thing for this task? – Thomas Jungblut Jul 11 '11 at 17:45
  • @Thomas Jungblut I just want to implement it on MapReduce. The whole point is I just want to see how I can iterating a sub skeleton like a map a number of times. iter(Map, 4) for it to run the mapper 4 times in parallel but the output of the firt map will be pass as input to the second. – asembereng Jul 11 '11 at 19:18

1 Answers1

0

For chaining one job into the next, check out: Chaining multiple MapReduce jobs in Hadoop

Also, this may be a good time to learn about sequence files, as they provide an efficient way of passing data from one map/reduce job to another.

As for your particular problem, you don't need reducers here, so make it map-only by setting the number of reducers to zero. Sending the output to reducers will only incur extra network overhead. (However, be careful about the number of files you create over time, eventually the NameNode will not appreciate it. Each mapper will create one file.)

I understand that you are trying to use this as an example of perhaps something more complex... but in this case you can use a common optimization technique: If you find yourself wanting to chain one mapper-only task into another map/reduce job, you can squash the two mappers together. For example, instead of multiplying by 2, then 2 gain, the 2 again, why not just multiply by 2 and by 2 and by in the same mapper? Basically, if all your operations are independent on one number or line, you can apply the iterations within the same mapper, per record. This will reduce the amount of overhead significantly.

Community
  • 1
  • 1
Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • Chaining will waiting for the first map to run and then later run another map on the output of the previous map? What I exactly want to do is for example. It will be easier to use a sketch of the Mapper class. – asembereng Jul 12 '11 at 01:49
  • Basically I just want the map function to be iterating a number of times. The above example is only to simplify the explanation but the operations within the map could be anything. – asembereng Jul 12 '11 at 02:34