Source: Google Interview Question
Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.
Have many large <string (url) -> int (visits)> maps
.
Calculate < string (url) -> int (sum of visits among all distributed maps)
, and get the top ten in the combined map.
Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.
I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.
All I could think of is brute force, which in some or other way, violates the given constraint.