0

Can't get my head around why standard Python code produces an unexpected result when translated to MapReduce using mrjob.

Example data from a .txt file:

1  12
1  14
1  15
1  16
1  18
1  12
2  11
2  11
2  13
3  12
3  15
3  11
3  10

This code creates a dictionary and performs a simple division calculation:

dic = {}

with open('numbers.txt', 'r') as fi:
    for line in fi:
        parts = line.split()
        dic.setdefault(parts[0],[]).append(int(parts[1]))

print(dic)

for k, v in dic.items():
    print (k, 1/len(v), v)

Result:

{'1': [12, 14, 15, 16, 18, 12], '2': [11, 11, 13], '3': [12, 15, 11, 10]}

1 0.16666666666666666 [12, 14, 15, 16, 18, 12]
2 0.3333333333333333 [11, 11, 13]
3 0.25 [12, 15, 11, 10]

But when translated to MapReduce using mrjob:

from mrjob.job import MRJob
from mrjob.step import MRStep
from collections import defaultdict

class test(MRJob):

    def steps(self):
        return [MRStep(mapper=self.divided_vals)]

    def divided_vals(self, _, line):

        dic = {}
        parts = line.split() 
        dic.setdefault(parts[0],[]).append(int(parts[1]))

        for k, v in dic.items():
            yield (k, 1/len(v)), v 

if __name__ == '__main__': 
    test.run()

Result:

["2", 1.0]  [11]
["2", 1.0]  [13]
["3", 1.0]  [12]
["3", 1.0]  [15]
["3", 1.0]  [11]
["3", 1.0]  [10]
["1", 1.0]  [12]
["1", 1.0]  [14]
["1", 1.0]  [15]
["1", 1.0]  [16]
["1", 1.0]  [18]
["1", 1.0]  [12]
["2", 1.0]  [11]

Why doesn't MapReduce group and calculate in the same way? How do I recreate the standard Python result in MapReduce?

RDJ
  • 4,052
  • 9
  • 36
  • 54
  • Where is your reducer? – OneCricketeer Dec 04 '17 at 19:26
  • Logical question. Not got there yet. There's more work to do in the mapper first, but that's dependent on having the right data structure to work with. – RDJ Dec 04 '17 at 20:47
  • By default, if you split the lines of input, with the first column as the key, the reducer will get `(1, [12, 14, 15, 16, 18, 12])` for the `1` column data ... From there, you can get `1/len(values)` – OneCricketeer Dec 04 '17 at 21:34
  • Would you mind elaborating on this as an answer I can accept. Are you saying I don't need the Dictionary? – RDJ Dec 09 '17 at 19:10
  • Mapreduce replaces your need for a dictionary, yes. The values are reduced into a list for a given key. Did you try not using a dictionary? – OneCricketeer Dec 09 '17 at 20:28
  • This works, yes, but I need the division in the mapper stage for further calculations in the mapper stage. I don't want to do the division in the reducer. – RDJ Dec 10 '17 at 14:51
  • Okay, well, the values are combined until the reducer, so **one** mapper would need to read the entire file at least once, which really defeats the purpose of using mapreduce... You can perform multiple map&reduce stages to get your data into whatever format you need. All I'm saying is that your keys will already combined in the reducer – OneCricketeer Dec 10 '17 at 17:54
  • Yeah I follow that and appreciate your time. As the question’s been live for a week with no answers I think I’ll have to fundamentally rethink this. – RDJ Dec 10 '17 at 18:02

0 Answers0