Python: How can I index in MapReduce(MRJob)?

Question

I want to index the result of reducer like this :

1   "EZmocAborM6z66rTzeZxzQ"
2   "FIk4lQQu1eTe2EpzQ4xhBA"
3   "myql3o3x22_ygECb8gVo7A"
4   "ojovtd9c8GIeDiB8e0mq2w"
5   "uVEoZmmL9yK0NMgadLL0CQ"

My Python MRJob code :

class MRUserDic(MRJob):
    count = 1

    def mapper(self, _, line):
        line = json.loads(line)
        yield line['user_id'], 1

    def reducer(self, key, values):
        yield self.count, key
        self.count += 1

if __name__ == '__main__':
    MRUserDic.run()

But this result in:

1   "EZmocAborM6z66rTzeZxzQ"
2   "FIk4lQQu1eTe2EpzQ4xhBA"
3   "myql3o3x22_ygECb8gVo7A"
1   "ojovtd9c8GIeDiB8e0mq2w"
2   "uVEoZmmL9yK0NMgadLL0CQ"

I know that it occurs because reducers are running in different machine.

Is there any way to share count variable among reducer?

score 0 · Answer 1 · answered Apr 12 '17 at 15:02

To sort the reducer output, you'll have to load the results into memory which can be done using a runner. Store your code into it's own .py file (MRUserDic.py) and implement the runner to sort the reducer output:

from MRUserDic import MRUserDic

reducer_output = []
mr_job = MRUserDic(args=['input_file.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        reducer_output.append(line)

    sorted_output = sorted(reducer_output)

Just replace 'input_file.txt' with the location of your input file.

Python: How can I index in MapReduce(MRJob)?

1 Answers1