finding the smallest number hadoop streaming python

Question

I am new to hadoop framework and map reduce abstraction.

Basically, I thought of finding the smallest number in a huge text file (delimited by ",")

So, here is my code mapper.py

 #!/usr/bin/env python

 import sys

 # input comes from STDIN (standard input)
 for line in sys.stdin:
 # remove leading and trailing whitespace
 line = line.strip()
 # split the line into words
numbers = line.split(",")
# increase counters
for number in numbers:
    # write the results to STDOUT (standard output);
    # what we output here will be the input for the
    # Reduce step, i.e. the input for reducer.py
    #
    # tab-delimited; the trivial word count is 1
    print '%s\t%s' % (number, 1)

reducer

  #!/usr/bin/env python

from operator import itemgetter
import sys
smallest_number = sys.float_info.max
for line in sys.stdin:
# remove leading and trailing whitespace
     line = line.strip()

# parse the input we got from mapper.py
     number, count = line.split('\t', 1)
     try:
           number = float(number)
     except ValueError:
            continue

     if number < smallest_number:
        smallest_number = number
        print smallest_number <---- i think the error is here... there is no key value thingy

     print smallest_number

The error I get:

       12/10/04 12:07:22 ERROR streaming.StreamJob: Job not successful. Error: NA
      12/10/04 12:07:22 INFO streaming.StreamJob: killJob...
          Streaming Command Failed!

What kind of results are you getting? What's the problem? What "key value thingy" are you talking about? — Junuxx, Oct 04 '12 at 19:30
@Junuxx: Hi.. I just posted the error.. basically.. how would a map reduce abstraction for finding a smallest number in a text file look like?/ The error i was talking about was.. mapper gives out (number,1) basically the same format as the mapper in word count example is. In reducer all i care about is the number.. I take the number and compare it with the smallest current number there and do the swap? — frazman, Oct 04 '12 at 19:44
It might be helpful to debug without Hadoop: `cat input | ./mapper.py | sort | ./reducer.py` Does this run successfully? — Matt D, Oct 04 '12 at 19:54
@MattD: No I am getting this echo "1,2,44,2" | mapper.py : No such file or directory I did chmod +x mapper.py and I am in the same directory? I am not sure why it is not able to find the file — frazman, Oct 04 '12 at 20:05
You may want to post your solution as an answer and accept it as the correct answer for future users. — Matt D, Oct 05 '12 at 13:22

score 0 · Answer 1 · edited Aug 18 '13 at 03:15

First of all, I want you to notice that your solution will not work unless you use only one reducer. Indeed, if you use multiple reducers then each reducer will spit out the smallest number it receives, and you will end up with more than one number. But then the next question is, if I have to use only one reducer for this problem (i.e., only one task) what do I gain by using MapReduce? The trick here is that the mappers will run in parallel. On the other hand, you don't want the mappers to output every number read, otherwise the one reducer will have to look through the whole data which provides no improvement over a sequential solution. The way to solve this problem is to have each mapper only output the smallest number it reads. In addition, since you want all the mappers outputs to go to the same reducer, the mapper output key must be the same over all mappers.

The mappers will look like this:

#!/usr/bin/env python                              

import sys

smallest = None
for line in sys.stdin:
  # remove leading and trailing whitespace          
  line = line.strip()
  # split the line into words                       
  numbers = line.split(",")
  s = min([float(x) for x in numbers])
  if smallest == None or s < smallest:
    smallest = s

print '%d\t%f' % (0, smallest)

The reducer:

#!/usr/bin/env python                                           

import sys

smallest = None
for line in sys.stdin:
  # remove leading and trailing whitespace                       
  line = line.strip()
  s = float(line.split('\t')[1])
  if smallest == None or s < smallest:
    smallest = s

print smallest

There are other possible ways to solve this problem, for example using the MapReduce framework itself to sort the numbers so that the first number the reducer receives is the smallest. If you want to understand more the MapReduce programming paradigm you can read this tutorial with examples, from my blog.

finding the smallest number hadoop streaming python

1 Answers1