Recreate Python dictionary results in MapReduce?

Question

Can't get my head around why standard Python code produces an unexpected result when translated to MapReduce using mrjob.

Example data from a .txt file:

This code creates a dictionary and performs a simple division calculation:

dic = {}

with open('numbers.txt', 'r') as fi:
    for line in fi:
        parts = line.split()
        dic.setdefault(parts[0],[]).append(int(parts[1]))

print(dic)

for k, v in dic.items():
    print (k, 1/len(v), v)

Result:

{'1': [12, 14, 15, 16, 18, 12], '2': [11, 11, 13], '3': [12, 15, 11, 10]}

1 0.16666666666666666 [12, 14, 15, 16, 18, 12]
2 0.3333333333333333 [11, 11, 13]
3 0.25 [12, 15, 11, 10]

But when translated to MapReduce using mrjob:

from mrjob.job import MRJob
from mrjob.step import MRStep
from collections import defaultdict

class test(MRJob):

    def steps(self):
        return [MRStep(mapper=self.divided_vals)]

    def divided_vals(self, _, line):

        dic = {}
        parts = line.split() 
        dic.setdefault(parts[0],[]).append(int(parts[1]))

        for k, v in dic.items():
            yield (k, 1/len(v)), v 

if __name__ == '__main__': 
    test.run()

Result:

["2", 1.0]  [11]
["2", 1.0]  [13]
["3", 1.0]  [12]
["3", 1.0]  [15]
["3", 1.0]  [11]
["3", 1.0]  [10]
["1", 1.0]  [12]
["1", 1.0]  [14]
["1", 1.0]  [15]
["1", 1.0]  [16]
["1", 1.0]  [18]
["1", 1.0]  [12]
["2", 1.0]  [11]

Why doesn't MapReduce group and calculate in the same way? How do I recreate the standard Python result in MapReduce?

Logical question. Not got there yet. There's more work to do in the mapper first, but that's dependent on having the right data structure to work with. — RDJ, Dec 04 '17 at 20:47
By default, if you split the lines of input, with the first column as the key, the reducer will get `(1, [12, 14, 15, 16, 18, 12])` for the `1` column data ... From there, you can get `1/len(values)` — OneCricketeer, Dec 04 '17 at 21:34
Would you mind elaborating on this as an answer I can accept. Are you saying I don't need the Dictionary? — RDJ, Dec 09 '17 at 19:10
Mapreduce replaces your need for a dictionary, yes. The values are reduced into a list for a given key. Did you try not using a dictionary? — OneCricketeer, Dec 09 '17 at 20:28
This works, yes, but I need the division in the mapper stage for further calculations in the mapper stage. I don't want to do the division in the reducer. — RDJ, Dec 10 '17 at 14:51
Okay, well, the values are combined until the reducer, so **one** mapper would need to read the entire file at least once, which really defeats the purpose of using mapreduce... You can perform multiple map&reduce stages to get your data into whatever format you need. All I'm saying is that your keys will already combined in the reducer — OneCricketeer, Dec 10 '17 at 17:54
Yeah I follow that and appreciate your time. As the question’s been live for a week with no answers I think I’ll have to fundamentally rethink this. — RDJ, Dec 10 '17 at 18:02

Recreate Python dictionary results in MapReduce?

0 Answers0