16

You work for Zynga, and want to count the number of currently active players for different games. Your web server handles pings from many different games and each user has a unique GUID. Must be able to query number of active users for one game at a time. Active users are those that gotten ping in last minute.

The log lines come in continuously into a web server:

10.1.12.13 - - "http://zynga.com/ping?guid=<guid>&game=<gameID>" -

What is the quickest/easiest way to count active users? Please suggest a 45 minutes answer with some code.


My version

// web server interface, every time ping comes in count() will be called
// void count(String gameId, String guid)
// int getNumberActivePlayers(String gameId)

struct Record{
  String gameID;
  String guid;
};

class PingStorage{
private:
  max_heap<long, Record> storage;
public:
  //    O(log(n))
  //  n = total number of elements in storage
  void count(String gameId, String guid){
    long currentTimeStamp = getUnixTimeStamp();
    Record rec ;
    rec.gameId = gameId;
    rec.guid = guid;
    storage.add(currentTimeStamp, rec);
  }
  //N = numner of records in last ,minutes in storage
  //O(N)
  int getNumberActivePlayers(String gameId){
    map<String, Set<string> > game2user;
    long tillTimeStamp = getUnixTimeStampNow() - 60;
    while(true){
      pair<long, Record> rec = storage.getMax(); //O(1)
      if(rec.first <= tillTimeStamp) break;  
      Set<String> temp = game2user[rec.gameid]; //O(1)
      temp.add(rec.userid); //O(log(N)) - O(1)
    }
    return game2user[gameID].size();
  }
};
moooeeeep
  • 31,622
  • 22
  • 98
  • 187
Master Yoda
  • 587
  • 3
  • 7

4 Answers4

7

Assuming this is a realtime solution, you can handle ping requests in O(1), generate the current player statistic in O(1), and using O(num_player) space by sacrificing some accuracy. The key is to discretize timings.

Overview

The basic idea is to represent discrete time intervals as objects and store in those objects the following property: the number of distinct players that pinged during this time interval that have not since pinged. To query the number of active users, calculate a weighted sum of the last x time intervals that comprise the last minute.

Details

First, select an acceptable temporal resolution. In this example, I choose 15-second intervals.

Maintain five PingInterval data structures to represent five of these intervals (spanning 1 more intervals than 1 minute). PingInterval contains one property: a counter. These PingIntervals are maintained in a PingMonitor. Each time a player pings, update a map within PingMonitor that maps each player to the current time interval. When you perform this mapping, take the following steps, which maintain the counts within the PingIntervals (according to the characteristics I described in the overview section).

  • If the player is already mapped to an interval and it is the current interval, do nothing.
  • Else, if the player is mapped to an interval that isn't the current interval,
    • decrement the old interval's count,
    • increment the current interval's count,
    • and map that player to that interval.
  • Else, if the player is not mapped to an interval at all,
    • increment the current interval's count,
    • map the player to the current interval.

(If the PingInterval to represent the current time doesn't exist yet, set the oldest PingInterval to null, create a new PingInterval in a thread-safe way, and then continue as normal.)

When you want to query the number of active users, calculate a time-passed-weighted sum of the last five intervals time intervals. For example, if you are only 5 seconds into the current time interval (meaning the next 10 seconds of that interval have not happened yet), calculate this value: 2/3 * oldest interval + sum of the 4 newest intervals.

Other thoughts

Five intervals is very conservative; we could scale up the number considerably for greater accuracy (maybe one per second) and it would still afford us a significant savings. The important part is that our times are now discrete intervals. This means that when we go to count the number of active users, we don't have to look at every individual time (which is equal to the number of users); instead we can look at x bins of time that we've predefined.

cheeken
  • 33,663
  • 4
  • 35
  • 42
  • shouldn't your complexity estimates for ping and query actually be O(num_bins) ? – moooeeeep Jun 15 '12 at 07:29
  • 1
    @moooeeeep Pinging does not loop over bins, so it is O(1). You are right, though, that query complexity is linearly proportional with the number of bins. However, since the number of bins is a pre-defined constant, I call its complexity constant time - especially since it's not really an "input" to the algorithm. – cheeken Jun 15 '12 at 10:17
  • I must have misread the thing on adding new entries. You keep creating new bins, instead of moving the GUID entries through the bins as time proceeds... – moooeeeep Jun 15 '12 at 10:37
  • This is not the solution. Lets say we pass you an array of gameid and you need to give number of active users for each gameid. How your solution would scale in that situation?? – Master Yoda Jun 17 '12 at 03:08
  • @MasterYoda The process I describe is a solution for a single game. For multiple games, just apply the solution multiple times. – cheeken Jun 17 '12 at 03:09
  • @cheeken suppose you don't know the number of games in advance. You need a data structure which keeps a map between gameid and number of active users. – Master Yoda Jun 17 '12 at 03:54
  • @MasterYoda I would maintain a hashtable between game GUID and PingMonitor. When a request comes in, I would grab the appropriage game's PingMonitor and invoke the operation on it. – cheeken Jun 17 '12 at 13:08
  • @cheeken Still your solution is not efficient enough. It means if we have 10000 gameid you keep no_of_time_interval*10000 TimeInterval Objects. Can you combine all the gameid in the same data structure. – Master Yoda Jun 18 '12 at 04:04
  • 1
    @MasterYoda What amount of space would be efficient enough? You cannot combine the data structures for individual games; if you did, then the player counts would be totals for all games, not individual games. – cheeken Jun 18 '12 at 04:20
3

My approach would be to use a deque (called queue in the remainder of this post) that all GUIDs are pushed to that are observed, i.e., which is thus ordered by age. In addition I would use a hashmap that contains the pointers to the entries of any GUID present in the queue.

When a new GUID is pushed to the queue, the old entry (if any) would be looked up in the hashmap, removed from the queue, and the new one assigned instead to the hashmap.

As time proceeds, all entries in the queue that surpass the age threshold, would to be popped out (and removed from the hashmap).

The length of the queue (a.k.a. number of active users) could be kept track of as a separate variable, to avoid hopping through the queue for every query.

To support multiple games just add such structure for every gameID.

Complexity: O(1) insert/deletion of observations (given a perfect hash, i.e. no collisions), O(1) query, O(n) space.

moooeeeep
  • 31,622
  • 22
  • 98
  • 187
  • Can you elaborate on how and when stale elements are removed from the queue? If you can do that in O(1), this sounds like the ideal solution. (But I think it might not be that easy!) – cheeken Jun 15 '12 at 10:19
  • @cheeken I assume that the timestamps always increase, therefore the dequeue is in order (oldest entry at the head, youngest at the back). Everytime an update or a query occurs, the GUIDs that are older than a minute will be popped out (O(1)), and removed from the hashmap (O(1), given a perfect hash). The counter is adapted accordingly. I think it's pretty straight forward. – moooeeeep Jun 15 '12 at 10:35
  • When a GUID is observed that already is in the queue, the previous entry is found via the hashmap, and removed from the dequeue (O(1)) before a new entry is pushed to the queue. – moooeeeep Jun 15 '12 at 10:40
  • I see, I assumed that was how you were doing it. But why is removing GUIDs that are older than a minute O(1)? The number of GUIDs you must remove is proportional to the number of players. Even if you can find where in the queue you want to start removing players and you just break that whole tail off, you still need to count the number of elements within it to update the count, right? Just making sure I'm not missing something here. – cheeken Jun 15 '12 at 10:50
  • @cheeken I meant just the deletion of each single GUID. The removal of the n_stale_guids GUIDs costs of course O(n_stale_guids). There is probably indeed room for improvement... – moooeeeep Jun 15 '12 at 11:01
  • @cheeken The operation of removing stale entries can happen at a 1-minute interval. So I think the entire algorithm is amortized to be O(1), if the pinging activities are intense enough. Also, looks like the entries in the queue are sorted by timestamp so we can use binary probe to determine the position where we want to chop the queue. Any thought? – KFL Jun 15 '12 at 20:31
  • @Kay For a realtime solution, purging entries at 1 minute intervals is not sufficient. Consider the state of the dequeue 1 second before a purge. At that time, the dequeue contains 1 minute 59 seconds' worth of entries. This is almost twice the value of the actual number of players. If a count were performed at this time, it would be almost twice the actual value. – cheeken Jun 15 '12 at 20:42
0

EDIT: I assumed this question wasn't about getting a real-time answer for the question "how many users are active now", but rather getting historical values - how many users were active at 3:25PM. I'm keeping the old solution below the new one:

So, you want to know how many users are active now, Keep a queue per game. Whenever you see a new log entry, find out which game it belongs to, and add it to the game's queue. After each addition clean up old entries at the beginning of the queue (all entries older than 1 minute at the time of the clean-up).

When asked for the number of active users in a game, do the same clean-up on the game's queue, and return the queue's depth.

Keep a hash mapping the games to the queues, and you got an O(N) operation with N being the number of lines in the log - each line is processed at most twice - once for adding it, once for removing it. You also do an additional comparison per addition and look-up (when deciding a queue entry is not old enough), but that's constant time times N. So O(N) in all.

Previous answer to that other question: Seeing there aren't all that many minutes (1440 a day), I would create a vector for each game, with a slot for each minute.

Go over the log file, for each line get the time, round it up to the closest minute and add 1 to the appropriate slot in the array. Once you're done, you'll know exactly how many active users were for each game at every minute.

Complexity - O(N), with N being the number of lines in the log file.

To support multiple games, just use a hash to map from the game's name to its vector.

Now, this assumes you only check for active users at whole minute boundries (1:00:00, 1:01:00, etc...). This is probably what you're required to do anyway.

zmbq
  • 38,013
  • 14
  • 101
  • 171
  • you are getting this wrong. Suppose that the query is made at 12:15:30, you would need to check both items on the 12:14 and 12:15, which can be log(n) in worst case. This is as good as the simplest algorithm of storing all values in one array and do a binary search on it. + simply using a counter wont give you the right answer. – Samy Arous Jun 14 '12 at 19:19
  • Actually, lookup is done in O(1) - 12:14 is cell number 14, 12:15 is cell number 15, 2:30 is cell number 150 (60 * hour + minute). – zmbq Jun 14 '12 at 19:21
  • what i mean is that you need to get all the items between 12:14:30 and 12:15:30, you can't just add up both cells (14,15). – Samy Arous Jun 14 '12 at 19:22
  • Indeed. But I assume he won't need to, he needs to say how many users were active each minute in the past, so you need to select boundries. If he just wants to know how many users are active *now*, it's a whole other question, that is also very simple - just maintain a queue for each game, pushing entries when you get the ping, and cleaning up too-old entries. – zmbq Jun 14 '12 at 19:24
  • It's the second case: "Must be able to query number of active users for one game at a time". and yes, that would be the solution I thought about. Using a queue. – Samy Arous Jun 14 '12 at 19:28
  • OK, I modified my answer to fit both cases. – zmbq Jun 14 '12 at 19:34
0

This would be my sequence of answers:

  1. Why bother? It's easiest to count minute-by-minute how many users are active. Isn't it enough to know that?
  2. If you really care about up-to-date information, let's count in second-by-second bins (as described by Cheeken). This will be accurate up to a fraction of a second.
  3. Ok, if real-time accuracy is "necessary", and you want to interview me about data structures, let's use a heap of customers scored by last activity time (as described by Master Yoda).
  4. If real-time accuracy is necessary, and we were to do this in production, let's use the data structure server Redis. We maintain a sorted set of customers scored by last activity time. We can query how many customers were active in the last minute or in the last hour with the zcount command. This is useful and reliable.
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465