Approximate count of events over given time frame

Question

I would like to have an efficient way of calculating the (approximate) count of a recurring event over a given time frame.

Example: I am trying to repeatedly download a file from a host. It usually works fine, but sometimes an error happens when the network is congested. I don't care about these single errors. Every once in a while though, the host is offline, so all my attempts fail. In that case I would like to automatically stop my program from trying again.

So I need to find out how many errors occured over the last x minutes. When the number is below a certain threshold, nothing happens. When it is above, I want to take an action. The count does not have to be 100% accurate, only accurate enough to tell me whether the threshold was reached.

A simple, yet ineffective (O(n)), way of doing this would be to just store the timestamps of the events, and then for every new event determine the number of previous events by iterating over them and comparing the timestamps (up until the time frame is reached). [aside] I imagine this is what SQL engines do for a WHERE timestamp BETWEEN NOW() AND INTERVAL X MINUTES, unless they have an index on the column. [/aside]

I want a solution with a constant (O(1)) complexity. So far I am thinking that I will keep a counter of the event that increases by 1 with every event. I will also store the timestamp of the most recent occurance. Then, when a new event happens, by some math magic I can decrease the counter using the current time and the stored timestamp to reflect approximately how many events happened over the last x minutes.

Unfortunately my math skills are not up to the task. Can someone provide some hints?

Related - [Design a datastructure to return the number of connections to a web server in last 1 minute](http://stackoverflow.com/a/18396955/1711796). You can use the queue-based approach if the interval is fixed. If there are a small number of options for the interval, you can have multiple pointers into the queue, one for each interval. Or the count-based approach should work. — Bernhard Barker, Sep 04 '13 at 15:03
Is that "X minutes" a constant for a particular run of the program? Or will you sometimes want to know how many errors occurred in the last 10 minutes, and other times want to know how many errors occurred in the last 30 minutes? — Jim Mischel, Sep 04 '13 at 15:41
x is a constant. There is however the need to keep track of different types of events over individually different time frames. — theintz, Sep 04 '13 at 15:43

score 2 · Answer 1 · answered Sep 04 '13 at 16:26

2

If you're just going to threshold the failure count within the last x minutes, why not store failure timestamps in a circular buffer of capacity equal to the threshold? Inserts are clearly O(1), and to check whether there have been enough failures, test whether the least recently inserted timestamp is within the last x minutes.

answered Sep 04 '13 at 16:26

David Eisenstat

64,237
7
60
120

I don't understand how that would reflect the threshold. Let me outline my understanding of the approach: I have a buffer and a second variable `$index` holding the position of the most recent insert. I need the index variable in order to do efficient inserts. In order to get the most recently inserted timestamp I can just get the field at `$index - 1`, but that does not tell me anything about whether the buffer had been completely filled before (as required per the threshold). – theintz Sep 04 '13 at 17:23
@t.heintz Testing whether there have been F failures in the last x minutes is the same as testing whether the Fth most recent failure was at most x minutes ago. You want to examine the field at `($index + 1) % F`, which will be overwritten by the next failure. – David Eisenstat Sep 04 '13 at 17:40
Ah I get it, I misread "least recently" for "most recently". Then it makes perfect sense of course! Very elegant. The only problem is that I might want to have a high threshold, which requires keeping a large buffer. – theintz Sep 04 '13 at 18:49

score 1 · Answer 2 · answered Sep 04 '13 at 16:20

A simple way to solve this is to have a threshold counter that you increase by one for each error, and reset to zero for each ok download. This would keep track of haw many download have failed in a row, and might be enough to solve your problem.

Alternatively you could do some kind of moving average. The following code is a simple way to do this:

errorRate = errorRate * 0.8
if (error) {
   errorRate = errorRate + 0.2
}

that gives a progression like this:

Download#   Status  errorRate
     1      ok      0.000
     2      ok      0.000        <=
     3      error   0.200        <= Low rate of errors
     4      ok      0.160        <= 
     5      ok      0.128        
     6      error   0.302        
     7      error   0.442        
     8      ok      0.354
     9      ok      0.283
    10      ok      0.226
    11      error   0.381
    12      error   0.505
    13      error   0.603
    14      ok      0.483
    15      error   0.586
    16      error   0.669        <= High rate of errors shows 
    17      ok      0.535
    18      ok      0.428
    19      ok      0.343
    20      ok      0.274
    21      ok      0.219        <= goes back down after some ok downloads
    etc..

You can play with the factors 0.8 and 0.2 to get a progression you like

The example I gave was not the actual use case. In the real use case, resetting the error counter after each successful event, is not sufficient. I do like your approach of the moving average. It does however require an update of the error rate on every event, not only on errors. It would probably be ok, but I am quite sure there is a solution that requires even less calculations. — theintz, Sep 04 '13 at 17:14
Alternativ - add one to a counter each time an error occoures, and then reduce this counter by 20 % each minute or similar — Ebbe M. Pedersen, Sep 04 '13 at 17:56
This already comes close to my initial idea of decreasing the error count proportionally to time elapsed since last error. In fact it might just work like this: `$error_count = $error_count * (1 - ((time() - $last_timestamp) / $time_interval))` — theintz, Sep 04 '13 at 18:55

score 0 · Accepted Answer · answered Sep 05 '13 at 10:35

Building on the comments from @ebbe-m-pedersen, this is what the solution will look like in PHP using Redis as a data store:

function error_handler() {
    $threshold = 100; // how many errors may occur
    $timeframe = 60 * 5; // 5 minutes, how many minutes may pass
    $now = time();

    // get error stats from redis
    $key_base = 'errors:';
    $count = $redis->get($key_base . 'count'); // calculated count
    $last = $redis->get($key_base . 'last'); // timestamp

    // calculate damping factor
    $rate = ($now - $last) / $timeframe;
    $rate = min($rate, 1.0); // $rate may not be larger than 1
    $rate = 1 - $rate; // we need the inverse for multiplying

    // calculate new error count
    $count = (int)($count * $rate);

    if ($count > $threshold) {
        // take action
    }

    // increase error
    $count++;

    // write error stats back to redis
    $redis->set($key_base . 'count', $count);
    $redis->set($key_base . 'last'. $now);
}

Approximate count of events over given time frame

3 Answers3