Input (Name;Date;Spent):
Alice;01/01/2020;100
Alice;02/01/2020;30
Alice;24/01/2020;50
Bob;24/01/2020;1500
Bob;24/01/2020;12
Bob;25/01/2020;16
Bob;25/01/2020;83
Bob;25/01/2020;91
Alice;13/02/2020;10
Alice;25/02/2020;3
The output has to be the name of people who bought in at least 5 different days. So for this input only Alice, since Bob bought 5 times but in 2 days.
My problem comes when I try to count the values. I already solved it using sets:
from mrjob.job import MRJob
class MR_ex2(MRJob):
def mapper(self, _, line):
person, day, spent_money = line.split(';')
yield person, day
def reducer(self, key, counts):
counts = set(counts)
if len(counts) >= 5:
yield key, key
if __name__ == '__main__':
MR_ex2.run()
But it crash when I try to add the combiner, and I don't think is the best way to do it. So I tried to search for examples but I can't find many, and the ones without MRJob are with some variable to save the state of the subproblem and iterate over the values of the subproblem(Alice and Bob in this problem) to join them, but I don't know how to use that way in MRJob.
So in short my question is how would be the right way to resolve this problem? Joining all the distinct values of key in reducer, and then check if it is 5 or more?