My reducer does not reduce the data completely

Question

I am using combiner along with mapper and reducer.

My mapper code is as below:

#!/usr/bin/env python

import sys
import datetime

def main():
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) != 6:
            continue
        try:
            float(data[4])
        except:
            continue
        else: # no exception
            sale_date = data[0]
            ymd = sale_date.split('-')
            date_obj = datetime.date(int(ymd[0]), int(ymd[1]), int(ymd[2]))
            print "{0}\t{1}".format(date_obj.weekday(), data[4])

main()

My reducer code is as below:

#!/usr/bin/env python
# this reducer also acts as a combiner
import collections
import sys

def main():

    sales_counter = collections.defaultdict(int)
    sales_sum = collections.defaultdict(float)

    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) == 3: # acting as reducer
            sales_counter[data[0]] = sales_counter[data[0]] + int(data[1])
            sales_sum[data[0]] = sales_sum[data[0]] + float(data[2])
        elif len(data) == 2: # acting as combiner
            sales_counter[data[0]] = sales_counter[data[0]] + 1
            sales_sum[data[0]] = sales_sum[data[0]] + float(data[1])
        else:
            continue # invalid line read, ignore

    for key in sorted(sales_sum):
        print key,"\t",sales_counter[key],"\t",sales_sum[key]

main()

And the data file format is as follows (only showing first 10 lines):

2012-01-01  09:00   San Jose    Men's Clothing  214.05  Amex
2012-01-01  09:00   Fort Worth  Women's Clothing    153.57  Visa
2012-01-01  09:00   San Diego   Music   66.08   Cash
2012-01-01  09:00   Pittsburgh  Pet Supplies    493.51  Discover
2012-01-01  09:00   Omaha   Children's Clothing 235.63  MasterCard
2012-01-01  09:00   Stockton    Men's Clothing  247.18  MasterCard
2012-01-01  09:00   Austin  Cameras 379.6   Visa
2012-01-01  09:00   New York    Consumer Electronics    296.8   Cash
2012-01-01  09:00   Corpus Christi  Toys    25.38   Discover
2012-01-01  09:00   Fort Worth  Toys    213.88  Visa

The result I got is as follows:

0   34034   8529272.78
0   567400  141834839.29
1   22715   5660345.68
1   566889  141586312.46
2   22611   5625669.74
2   555219  138745830.2
3   22666   5633975.27
3   567051  141719805.3
4   25365   6363847.75
4   563769  141051081.75
5   34131   8560716.09
5   555310  138849461.48
6   34071   8503163.7
6   567245  141793631.77

I was hoping to see only one entry for each key (first column). And the correct result is indeed obtained by combining the partial results for each key. But my question is why are there partial results for each key?

are you running in a single computer? what path are you testing combiner or reducer? — Picarus, Oct 11 '15 at 23:27
@Picarus That is correct, it is a single Virtual machine actually. — Ankur Agarwal, Oct 11 '15 at 23:28
I think the problem is in your process to generate the key. Somehow you get two values for what you expect to be the same value. Data cleaning is the most sensitive and usually time consuming part in this kind of processing. I cannot be more specific without seeing data. Can you maybe generate an smaller dataset that reproduces the error? What does ymd[2] contain? Why do you convert data[4] back to text and do not generate another data structure that takes advantage of the processing you have already done? — Picarus, Oct 11 '15 at 23:42
@Picarus ymd contains year, month, day. Essentially the list generated on splitting up the column1 in data input file on '-' (hyphen). data[4] is just being tested for conversion to float and that is it for data[4]. I need to ignore lines where data[4] cannot be converted to float. Not sure about your comment about creating a data structure to reuse data[4]. And I printed the types of all keys. They all came out as str. I think it has something to do with the way hadoop mappers and combiners work. — Ankur Agarwal, Oct 12 '15 at 00:02
My comment refers to the fact that you convert from string to float and back to string... what is expensive, but performance is not the key here. I would keep looking at your key generation process...use int instead of string. How do you run your hadoop job? — Picarus, Oct 12 '15 at 00:17

My reducer does not reduce the data completely

0 Answers0