I was asked this question in an interview, and I'm not sure if I gave the proper answer, so I would like some insights.
The problem: There is a stream of users and items. At each minute, I receive a list of tuples (user, item), representing that a user u consumed item i. I need to find the top 100 popular items in the past hour, i.e., calculate how many users consumed each item and sort them. The trick here is that in the past hour, if an item is consumed by the same user more than once, only 1 consumption is considered. No repeated consumption by the same user is allowed. The interviewer said that I should think big and there would be millions of consumptions per hour. So, he suggested me to do a map-reduce job or something that can deal with this large amount of data per minute.
The solution I came up with: I said that I could maintain a list (or a matrix if you prefer) of the consumed user-item-timestamp tuples, as if there was a time-window shifting. Something like:
- u1,i1,t1
- u1,i2,t1
- u2,i2,t2... and so on.
At each minute, when I receive the stream of user-items consumption for this minute, I first make a map-reduce job to update the time-window matrix, with the current timestamp. This map-reduce job could be done by two mappers (one for the stream and the other for the time-window list), and the reducer would simply get the maximum for each pair. A pseudo-code for what I did:
mapTimeWindow(line):
user, item, timestamp = line.split(" ")
context.write(key=(user,item), value=timestamp)
mapStream(line):
user, item = line.split(" ")
context.write(key=(user,item), value=now())
reducer(key, list):
context.write(key=(user,item), value=max(list))
Next, I also do a map-reduce to calculate the popularity by calculating the times that each user appear in that list. My map reads for the updated time window list and write item and 1. The reducer calculates the sum of the list for each item. Since I am storing all the timestamp, I verify if the consumption is in the past hour or not. Another map-reduce pseudo-code:
mapPopularity(line):
user, item, timestamp = line.split(" ")
if now()-60>timestamp:
return
context.write(key=item, value=1) # no repetition
reducerPopularity(key, list):
context.write(key=item, value=sum(list))
Later we can do another map-reduce to read from the result of the second job and calculate the top100 largest items. Something done like this.
My question: is this solution acceptable for the interview I had? It contains three map-reduces to solve the problem. However, it seems to me to be quite a lot to execute at each minute. Since it needs to be updated at every minute, it cannot last longer than that. I mean, I put quite a lots of efforts into it, but the interviewer didn't give me a feedback if it is right or not. I would to know: is it possible to make it faster? Or is it possible to deal with this in another way? (maybe not map-reduce)