9

Source: Google Interview Question

Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.

Have many large <string (url) -> int (visits)> maps.

Calculate < string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map.

Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.

I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.

All I could think of is brute force, which in some or other way, violates the given constraint.

Community
  • 1
  • 1
Spandan
  • 2,128
  • 5
  • 25
  • 37
  • Looks like map-reduce problem to me https://en.wikipedia.org/wiki/MapReduce – matcheek Jul 29 '13 at 15:38
  • Sounds like a map-reduce problem. Map URI's of web-pages and then reduce should add all same URI's hits and emit pair – noMAD Jul 29 '13 at 15:39
  • @noMAD : see update constraint – Spandan Jul 29 '13 at 15:39
  • @matcheek: see constarint . – Spandan Jul 29 '13 at 15:40
  • What do you mean by can't use map-reduce directly? – noMAD Jul 29 '13 at 15:40
  • link to original question - > http://www.glassdoor.com/Interview/Google-Interview-RVW2184032.htm – Spandan Jul 29 '13 at 15:41
  • @noMAD : cant say much bout it.please see d link above. – Spandan Jul 29 '13 at 15:41
  • How about this, build a max heap of size 10 in each machine by traversing the whole map. Now transmit all these max heaps to one machine and build a new max heap of size 10. This is just an abstract idea but I think this should work – noMAD Jul 29 '13 at 15:47
  • 1
    possible duplicate of [Parallel top ten algorithm for distributed data](http://stackoverflow.com/questions/15613966/parallel-top-ten-algorithm-for-distributed-data) – Jim Mischel Jul 29 '13 at 15:50
  • 2
    @noMAD: That won't work. Imagine that the top 10 on each machine are unique, but the 11th most frequent on each machine is the same. So the most frequent overall (that 11th item on each machine) is not in the top 10 for any machine. Your solution would not report that item. – Jim Mischel Jul 29 '13 at 15:52
  • @JimMischel Good catch. Well, one way to encounter that would be build a max heap in one machine and use that to build a max heap in the next machine and so on. It is a solution but that feels so slow. – noMAD Jul 29 '13 at 15:59

2 Answers2

14

It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:

  1. pre-processing: let R be the number of servers in cluster, give each server unique id from 0,1,2,...,R-1
  2. (map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
  3. (reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
  4. (map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
  5. (reduce) Server 0 will yield the top 10 across the network.

Notes:

  • The problem with the algorithm, like most big-data algorithms that don't use frameworks is handling failing servers. MapReduce takes care of it for you.
  • The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
murrekatt
  • 5,961
  • 5
  • 39
  • 63
amit
  • 175,853
  • 27
  • 231
  • 333
  • 3
    I don't understand how this addresses @JimMischel's scenario where items 1-10 are unique between all servers, but item 11 is the same on all servers, resulting in item 11 being the most frequent item. – RustyTheBoyRobot Jul 29 '13 at 16:29
  • 2
    @RustyTheBoyRobot Because at step 2 you make sure each item will be processed by only 1 server. The local top10 sent by each server in step 4 makes sure that #11 will NOT appear in any of the other servers, and thus will never be better then the top 10. – amit Jul 29 '13 at 16:46
  • 1
    The question says "The maps are too large to transmit over the network" and your algorithm sends the whole data over the network. – Thomash Jul 29 '13 at 17:23
  • 1
    @Thomash "The maps are too large to transmit over the network" That usually means you don't want to send everything to everyone. You send all the data once, not to everyone. Also, you can reduce the number of sends by combining a partial count for each original server. – amit Jul 29 '13 at 18:05
  • +1. Actually, the number of records required for step 4 is not `10 * R`. Server 0 can ask for the top item from all servers and put them into a priority queue. Then remove the highest item and request the next item from the server that produced that highest item. Repeat that process. The maximum number of items required to transmit would be `R + 9`. Your approach adds a distributed merge to the approach I recommended in the linked duplicate question. Nice. – Jim Mischel Jul 29 '13 at 19:03
  • @amit : Awesome n Brilliant way.Thanks . – Spandan Jul 29 '13 at 22:35
  • @amit Hi Amit, what is the difference between the solution you provided and the solution if we can use map reduce directly? – CSnerd Sep 22 '16 at 21:24
  • @CSnerd The answer already explicitly says how to do it with MR directly, use 2 MR phases accordingly. – amit Sep 23 '16 at 09:13
3

In the worst case any algorithm, which does not require transmitting the whole frequency table, is going to fail. We can create a trivial case where the global top-10s are all at the bottom of every individual machines list.

If we assume that the frequency of URIs follow Zipf's law, we can come up with effecive solutions. One such solution follows.

Each machine sends top-K elements. K depends solely on the bandwidth available. One master machine aggregates the frequencies and finds the 10th maximum frequency value "V10" (note that this is a lower limit. Since the global top-10 may not be in top-K of every machine, the sum is incomplete).

In the next step every machine sends a list of URIs whose frequency is V10/M (where M is the number of machines). The union of all such is sent back to every machine. Each machines, in turn, sends back the frequency for this particular list. A master aggregates this list into top-10 list.

ElKamina
  • 7,747
  • 28
  • 43
  • I like your idea. Could you elaborate more? I am not sure I understood why V10/M is the threshold. – rops Oct 02 '13 at 17:33
  • @daniele In the worst case all the occurrences of a number can be equally divided on M machines. So V10/M threshold is used. – ElKamina Oct 02 '13 at 20:07
  • I guess you have a typo, you should say "is or more" like below statement: In the next step every machine sends a list of URIs whose frequency is V10/M or more (where M is the number of machines). – bjethwan Nov 12 '17 at 12:47
  • @ElKamina - I guess you have a typo, you should say "is or more" like in below corrected statement: "In the next step every machine sends a list of URIs whose frequency is V10/M or more (where M is the number of machines)." And there's another issue, when you send the union of the final list of urls back to all machines, those have to be intelligent to not to add the frequency back unless lesser than V10/M. Let me know.Otherwise you solution is cool. And unlike above amit's solution which requires to process all the urls in all the machines. – bjethwan Nov 12 '17 at 12:58