Map Reduce Job to find the popular items in a time window

Question

I was asked this question in an interview, and I'm not sure if I gave the proper answer, so I would like some insights.

The problem: There is a stream of users and items. At each minute, I receive a list of tuples (user, item), representing that a user u consumed item i. I need to find the top 100 popular items in the past hour, i.e., calculate how many users consumed each item and sort them. The trick here is that in the past hour, if an item is consumed by the same user more than once, only 1 consumption is considered. No repeated consumption by the same user is allowed. The interviewer said that I should think big and there would be millions of consumptions per hour. So, he suggested me to do a map-reduce job or something that can deal with this large amount of data per minute.

The solution I came up with: I said that I could maintain a list (or a matrix if you prefer) of the consumed user-item-timestamp tuples, as if there was a time-window shifting. Something like:

u1,i1,t1
u1,i2,t1
u2,i2,t2... and so on.

At each minute, when I receive the stream of user-items consumption for this minute, I first make a map-reduce job to update the time-window matrix, with the current timestamp. This map-reduce job could be done by two mappers (one for the stream and the other for the time-window list), and the reducer would simply get the maximum for each pair. A pseudo-code for what I did:

mapTimeWindow(line):
    user, item, timestamp = line.split(" ")
    context.write(key=(user,item), value=timestamp)

mapStream(line):
    user, item = line.split(" ")
    context.write(key=(user,item), value=now())

reducer(key, list):
    context.write(key=(user,item), value=max(list))

Next, I also do a map-reduce to calculate the popularity by calculating the times that each user appear in that list. My map reads for the updated time window list and write item and 1. The reducer calculates the sum of the list for each item. Since I am storing all the timestamp, I verify if the consumption is in the past hour or not. Another map-reduce pseudo-code:

mapPopularity(line):
    user, item, timestamp = line.split(" ")
    if now()-60>timestamp:
         return
    context.write(key=item, value=1) # no repetition

reducerPopularity(key, list):
    context.write(key=item, value=sum(list))

Later we can do another map-reduce to read from the result of the second job and calculate the top100 largest items. Something done like this.

My question: is this solution acceptable for the interview I had? It contains three map-reduces to solve the problem. However, it seems to me to be quite a lot to execute at each minute. Since it needs to be updated at every minute, it cannot last longer than that. I mean, I put quite a lots of efforts into it, but the interviewer didn't give me a feedback if it is right or not. I would to know: is it possible to make it faster? Or is it possible to deal with this in another way? (maybe not map-reduce)

Attersson · Answer 1 · 2018-05-13T13:50:15.170

Telling if your solution is acceptable or not, is ultimately an opinion. The interviewer could appreciate your algorithm or perhaps your problem solving process and your thinking. Only your interviewer can ultimately tell. Your solution certaintly follows a logic and does the job, if the algorithm you wrote is implemented in a complete and correct way.

My solution:

As you explained, the main concern is performance, since we have big data, so we shall reduce space complexity, time complexity and number of executions by keeping it to the least amount necessary.

Space complexity

I would keep one list of [user,timestamp] per item (or more performant collection depending on the libraries you use but I will keep it base-case here. See dict note at the end). Every new item has its own list. This essentially is better than an overall [user, timestamp,item] because that is worse in memory usage due to the extra field and requiring an additional map operation or maybe just filtering because you have to process all associations existing to extract those "per item". More easily, you can get the list for that item "by hash" or by reference in the code. This model is the minimalistic one.

Time complexity

That said, there is the purge operation and the popularity extraction. Since we want to limit hits, but you must check timestamps every time you calculate current popularity due to specifics, you must scroll your list requiring complexity of O(n). Therefore: Filter by current time <60 the way you did. This will purge expired associations. Then simply len(list_of_that_item). Complexity O(1). Done.

Since the linear search cost is paid by the filtering, a reduce operation would pay a similar cost if you want to count the non expired entries without purging. If and only if deleting from the list has a bigger overhead, you may want to benchmark a non-deleting algorithm that keeps associations "forever" and you manually schedule purging operations. Although the previous solution should perform better, it is worth mentioning for completeness.

Insertion

If you use dicts it's trivial (and more performant too). Updating the timestamp or inserting if not present are the same code: strawberry["Mike"]=timestamp. Moreover the overall associations set is a dict with key=item and value=per_item_dict and per_item_dict has key=user value=timestamp. Therefore data[strawberry]["Mike"]=timestamp

Edit: adding some more code

Purge

data[strawberry] = {k: v for k, v in data[strawberry].items() if your_time_condition_expression}

Popularity check

After purge: len(data[strawberry])

Map Reduce Job to find the popular items in a time window

1 Answers1