MRJob - Iterating over values

Question

Input (Name;Date;Spent):

Alice;01/01/2020;100
Alice;02/01/2020;30
Alice;24/01/2020;50
Bob;24/01/2020;1500
Bob;24/01/2020;12
Bob;25/01/2020;16
Bob;25/01/2020;83
Bob;25/01/2020;91
Alice;13/02/2020;10
Alice;25/02/2020;3

The output has to be the name of people who bought in at least 5 different days. So for this input only Alice, since Bob bought 5 times but in 2 days.

My problem comes when I try to count the values. I already solved it using sets:

from mrjob.job import MRJob

class MR_ex2(MRJob):
    def mapper(self, _, line):
        person, day, spent_money = line.split(';')
        yield person, day

    def reducer(self, key, counts):
        counts = set(counts)
        if len(counts) >= 5:
            yield key, key

if __name__ == '__main__':
    MR_ex2.run()

But it crash when I try to add the combiner, and I don't think is the best way to do it. So I tried to search for examples but I can't find many, and the ones without MRJob are with some variable to save the state of the subproblem and iterate over the values of the subproblem(Alice and Bob in this problem) to join them, but I don't know how to use that way in MRJob.

So in short my question is how would be the right way to resolve this problem? Joining all the distinct values of key in reducer, and then check if it is 5 or more?

Kind of fixed, knowing that I can do `values = [x for x in counts]` to extract all the values of each key. Now I can keep advancing — set92, Nov 30 '20 at 16:35

score 0 · Answer 1 · answered Dec 01 '20 at 12:44

Probably not the best answer, but in case someone arrives here and have a similar question. My way of solving it partially was to unroll the elements from the generator, and after it create the set, this way I avoid the error, although not sure if using set in MapReduce is a good technique.

But the code would be:

from mrjob.job import MRJob

class MR_ex2(MRJob):

    def mapper(self, _, line):
        persona, dia, dinero_gastado = line.split(';')
        yield persona, dia

    def combiner(self, key, values):
        counts = set([item for sublist in values for item in sublist])
        if len(counts) >= 5:
            yield key, key
        yield key, tuple(values)

    def reducer(self, key, counts):

        counts = set([item for sublist in counts for item in sublist])
        if len(counts) >= 5:
            yield key, key

if __name__ == '__main__':
    MR_ex2.run()

MRJob - Iterating over values

1 Answers1