Implementation of a "hits in last [second/minute/hour]" data structure

Question

I think this is a fairly common question but I can't seem to find answer by googling around (maybe there's a more precise name for the problem I don't know?)

You need to implement a structure with a "hit()" method used to report a hit and hitsInLastSecond|Minute|Hour methods. You have a timer with say nanosecond accuracy. How do you implement this efficiently?

My thought was something like this (in psuedo-C++)

class HitCounter {
  void hit() {
    hits_at[now()] = ++last_count;
  }

  int hitsInLastSecond() {
    auto before_count = hits_at.lower_bound(now() - 1 * second)
    if (before_count == hits_at.end()) { return last_count; }
    return last_count - before_count->second;
  }

  // etc for Minute, Hour

  map<time_point, int> hits_at;
  int last_count = 0;
};

Does this work? Is it good? Is something better?

Update: Added pruning and switched to a deque as per comments:

class HitCounter {
  void hit() {
    hits.push_back(make_pair(now(), ++last_count));
  }

  int hitsInLastSecond() {
    auto before = lower_bound(hits.begin(), hits.end(), make_pair(now() - 1 * second, -1));
    if (before == hits.end()) { return last_count; }
    return last_count - before_count->second;
  }

  // etc for Minute, Hour

  void prune() {
    auto old = upper_bound(hits.begin(). hits.end(), make_pair(now - 1 * hour, -1));
    if (old != hits.end()) {
      hits.erase(hits.begin(), old)
    }
  }

  deqeue<pair<time_point, int>> hits;
  int last_count = 0;
};

If you don't need to keep historic hits, you might want to clean out old data from the map now and again otherwise you will eventually run out of memory. — Neil Kirk, Aug 21 '13 at 18:03
Assuming your "time" is ever-incrementing (and it certainly appears so), a `std::map<>` (and its ensuing expense) should not be necessary, since by definition a simple `std::vector<>` will be sorted by circumstance if you always push the most recent hit on the backside. That said, the idea looks right, and as Neil said, you need to clean out data outside your maximum window, preferably at each push. — WhozCraig, Aug 21 '13 at 18:04
I thought time_t is only second accuracy, but you say you want nanosecond accuracy. — Neil Kirk, Aug 21 '13 at 18:07
You probably want a histogram. http://stackoverflow.com/questions/4515874/searching-for-fast-efficient-histogram-algorithm — kfsone, Aug 21 '13 at 18:08
How accurate do you want hitsInLastMinute to be? e.g. if it is 12.01.234 then do you just want hits from 12.00 to 12.01 or do you need hits from 12.00.234 to 12.01.234? You can greatly decrease memory requirement if you can quantise the time ranges a bit more. — Peter de Rivaz, Aug 21 '13 at 18:10
@WhozCraig Ah, that's true. But if we want to add pruning (as per Neil Kirk's comment), that would probably be more efficient with a map? — alecbz, Aug 21 '13 at 18:10
@PeterdeRivaz I believe the intent of the question was the latter. If it's 12:01, hits in last hour shouldn't return just the hits in the last minute. — alecbz, Aug 21 '13 at 18:12
@alecbenzer You could prune equally by simply using the vector iterator-ranged erase as well. I only mentioned it because of the nature of the very data you're trying to manage, as you only seem to care about ranges, anyway. Its just a different option (and simpler to understand in my head, anyway). Oh, and +1 for the interesting question. — WhozCraig, Aug 21 '13 at 18:16
@WhozCraig But range erase on a vector is more expensive than on a map, no? Especially when you're erasing data from not the end and the size of the data you're erasing is small relative to the total amount of data in the vector (which would probably be the case here). — alecbz, Aug 21 '13 at 18:20
@alecbenzer If that is a concern, I'd likely use a `std::deque<>` then, which I would think an even better fit than a `std::vector<>`. It just seems an awful expense to maintain an random-insert-map when all you'll ever do is grow/shrink on the ends. It isn't like you'll be inserting times in the *middle* of other times. (and now that I think about that, this will give the RB-tree rebalance algorithms a real nice workout =P) — WhozCraig, Aug 21 '13 at 18:24

kfsone · Answer 1 · 2013-08-22T04:31:39.837

What you are describing is called a histogram.

Using a hash, if you intend nanosecond accuracy, will eat up much of your cpu. You probably want a ring buffer for storing the data.

Use std::chrono to achieve the timing precision you require, but frankly hits per second seems like the highest granularity you need and if you are looking at the overall big picture, it doesn't seem like it will matter terribly what the precision is.

This is a partial, introductory sample of how you might go about it:

#include <array>
#include <algorithm>

template<size_t RingSize>
class Histogram
{
    std::array<size_t, RingSize> m_ringBuffer;
    size_t m_total;
    size_t m_position;
public:
    Histogram() : m_total(0)
    {
        std::fill_n(m_ringBuffer.begin(), RingSize, 0);
    }

    void addHit()
    {
        ++m_ringBuffer[m_position];
        ++m_total;
    }

    void incrementPosition()
    {
        if (++m_position >= RingSize)
            m_position = 0;
        m_total -= m_ringBuffer[m_position];
        m_ringBuffer[m_position] = 0;
    }

    double runningAverage() const
    {
        return (double)m_total / (double)RingSize;
    }

    size_t runningTotal() const { return m_total; }
};

Histogram<60> secondsHisto;
Histogram<60> minutesHisto;
Histogram<24> hoursHisto;
Histogram<7> weeksHisto;

This is a naive implementation which assumes you will call it every second and increment the position, and will transpose runningTotal from one histogram to the next every RingSize (so every 60s, add secondsHisto.runningTotal to minutesHisto).

Hopefully it will be a useful introductory place to start from.

If you want to track a longer histogram of hits per second, you can do that with this model, by increasing the ring size, add a second total to track the last N ring buffer entries, so that m_subTotal = sum(m_ringBuffer[m_position - N .. m_position]), similar to the way m_total works.

size_t m_10sTotal;

...

void addHit()
{
    ++m_ringBuffer[m_position];
    ++m_total;
    ++m_10sTotal;
}

void incrementPosition()
{
    // subtract data from >10 sample intervals ago.
    m_10sTotal -= m_ringBuffer[(m_position + RingBufferSize - 10) % RingBufferSize];
    // for the naive total, do the subtraction after we
    // advance position, since it will coincide with the
    // location of the value RingBufferSize ago.
    if (++m_position >= RingBufferSize)
        m_position = 0;
    m_total -= m_ringBuffer[m_position];
}

You don't have to make the histo grams these sizes, this is simply a naive scraping model. There are various alternatives, such as incrementing each histogram at the same time:

secondsHisto.addHit();
minutesHisto.addHit();
hoursHisto.addHit();
weeksHisto.addHit();

Each rolls over independently, so all have current values. Size each histo as far as you want data at that granularity to go back.

I believe the intent of the problem was that "hitsInLastHour()" will return the number of hits in between now and 1 hour ago, not in between now and the last hour-boundary. A histogram won't work for the former, right? — alecbz, Aug 21 '13 at 19:14
Sure it can, the implementation above is relatively naive, in so much as there is a single running total and it is fixed to cheaply sum the current contents of the ring buffer. You could could always increment the hits per hour at the same time as the hits per second, instead of scraping it. — kfsone, Aug 22 '13 at 04:24
@alecbenzer: If you want e.g. last hour totals with latency < 1s, you just need a ring buffer of size >= 60 * 60. If you want last-hour totals with latency < 0.01s, the ring buffer will have to be of size >= 60 * 60 * 100, etc. Still very reasonable space usage, and the nice thing is that the space usage is proportional to the latency (granularity) you want, which is directly under your control: no aspect of the space usage depends on the rate or total number of hits. +1. — j_random_hacker, Aug 22 '13 at 10:49
@j_random_hacker I think the intent of the problem was to have precision as high as our timer provides. Ie, HitsInLastSecond() should report the hits that occurred up to one second ago, but not include ones that happened one second + one nanosecond ago. Same for hour. So for this we'd need O(nanos_per_hour) space ~ 3.6TB, no? — alecbz, Aug 22 '13 at 18:05
@alecbenzer: So I take it this is a problem you saw and wanted to solve, not one you need a solution to for practical reasons? If you need absolutely zero latency, then yes, I think you would need an enormous array if you went with this approach; but I think for all practical purposes, counting the number of hits in the last hour with 1s latency is very practical, and then this approach is the by far the best :) — j_random_hacker, Aug 22 '13 at 21:54

Implementation of a "hits in last [second/minute/hour]" data structure

1 Answers1

Linked

Related