Finding mean median using python hadoop streaming

Question

Very dumb question.. I have data as following

id1, value
1, 20.2
1,20.4
....

I want to find the mean and median of id1? (Note.. mean, median for each id and not the global mean,median) I am using python hadoop streaming..

mapper.py
for line in sys.stdin:
    try:
    # remove leading and trailing whitespace
        line = line.rstrip(os.linesep)
        tokens = line.split(",")

            print '%s,%s' % (tokens[0],tokens[1])
    except Exception:
        continue


reducer.py
data_dict = defaultdict(list)
def mean(data_list):
    return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
    sorts = sorted(mylist)
    length = len(sorts)
    if not length % 2:
        return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
    return sorts[length / 2]


for line in sys.stdin:
    try:
        line = line.rstrip(os.linesep)
        serial_id, duration = line.split(",")
        data_dict[serial_id].append(float(duration))
    except Exception:
        pass
for k,v in data_dict.items():
    print "%s,%s,%s" %(k, mean(v), median(v))

I am expecting a single mean,median to each key But I see id1 duplicated with different mean and median.. For example.. on doing grep..

mean_median/part-00003:SH002616940000,5.0,5.0   
mean_median/part-00008:SH002616940000,901.0,901.0   
mean_median/part-00018:SH002616940000,11.0,11.0 
mean_median/part-00000:SH002616940000,2.0,2.0   
mean_median/part-00025:SH002616940000,1800.0,1800.0 
mean_median/part-00002:SH002616940000,4.0,4.0   
mean_median/part-00006:SH002616940000,8.0,8.0   
mean_median/part-00021:SH002616940000,14.0,14.0 
mean_median/part-00001:SH002616940000,3.0,3.0   
mean_median/part-00022:SH002616940000,524.666666667,26.0    
mean_median/part-00017:SH002616940000,65.0,65.0 
mean_median/part-00016:SH002616940000,1384.0,1384.0 
mean_median/part-00020:SH002616940000,596.0,68.0    
mean_median/part-00014:SH002616940000,51.0,51.0 
mean_median/part-00004:SH002616940000,6.0,6.0   
mean_median/part-00005:SH002616940000,7.0,7.0

Any suggestions?

By default, Streaming uses tab as the delimiter. Have you set it to use comma? — Donald Miner, Apr 01 '13 at 23:03
Yeah.. I think so.. I mean I am using tokens = line.split(",")?? So it parses fine? — frazman, Apr 01 '13 at 23:04
Not a dumb question at all, believe me. :) Any problem which needs to have an idea of global state (like mean/median) is not that straightforward to do in Hadoop. — Suman, Apr 02 '13 at 19:43

score 1 · Accepted Answer · answered Apr 02 '13 at 09:16

I have answered the same problem at hadoop-user mailing list as following:

How many Reducer did you start for this job? If you start many Reducers for this job, it will produce multiple output file which named as part-*. And each part is only the local mean and median value of the specific Reducer partition.

Two kinds of solutions: 1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1, and it will produce only one output file and each distinct key will produce only one mean and median value. 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source code. It process all the output file which produced by multiple Reducer by a local function, and it produce the ultimate result.

Finding mean median using python hadoop streaming

1 Answers1