I am using combiner along with mapper and reducer.
My mapper code is as below:
#!/usr/bin/env python
import sys
import datetime
def main():
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) != 6:
continue
try:
float(data[4])
except:
continue
else: # no exception
sale_date = data[0]
ymd = sale_date.split('-')
date_obj = datetime.date(int(ymd[0]), int(ymd[1]), int(ymd[2]))
print "{0}\t{1}".format(date_obj.weekday(), data[4])
main()
My reducer code is as below:
#!/usr/bin/env python
# this reducer also acts as a combiner
import collections
import sys
def main():
sales_counter = collections.defaultdict(int)
sales_sum = collections.defaultdict(float)
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) == 3: # acting as reducer
sales_counter[data[0]] = sales_counter[data[0]] + int(data[1])
sales_sum[data[0]] = sales_sum[data[0]] + float(data[2])
elif len(data) == 2: # acting as combiner
sales_counter[data[0]] = sales_counter[data[0]] + 1
sales_sum[data[0]] = sales_sum[data[0]] + float(data[1])
else:
continue # invalid line read, ignore
for key in sorted(sales_sum):
print key,"\t",sales_counter[key],"\t",sales_sum[key]
main()
And the data file format is as follows (only showing first 10 lines):
2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex
2012-01-01 09:00 Fort Worth Women's Clothing 153.57 Visa
2012-01-01 09:00 San Diego Music 66.08 Cash
2012-01-01 09:00 Pittsburgh Pet Supplies 493.51 Discover
2012-01-01 09:00 Omaha Children's Clothing 235.63 MasterCard
2012-01-01 09:00 Stockton Men's Clothing 247.18 MasterCard
2012-01-01 09:00 Austin Cameras 379.6 Visa
2012-01-01 09:00 New York Consumer Electronics 296.8 Cash
2012-01-01 09:00 Corpus Christi Toys 25.38 Discover
2012-01-01 09:00 Fort Worth Toys 213.88 Visa
The result I got is as follows:
0 34034 8529272.78
0 567400 141834839.29
1 22715 5660345.68
1 566889 141586312.46
2 22611 5625669.74
2 555219 138745830.2
3 22666 5633975.27
3 567051 141719805.3
4 25365 6363847.75
4 563769 141051081.75
5 34131 8560716.09
5 555310 138849461.48
6 34071 8503163.7
6 567245 141793631.77
I was hoping to see only one entry for each key (first column). And the correct result is indeed obtained by combining the partial results for each key. But my question is why are there partial results for each key?