How to design a system in which we can query top results in last n hours

Question

I was asked this question in an interview. The details were that assume we are getting millions of events. Each event has a timestamp and other details. The systems design requires ability to enable end user to query most frequent records in last 10 minutes or 9 hours or may be 3 months.

Event can be seen as following

event_type: {CRUD + Search}
event_info: xxx
timestamp : ts...

@j_random_hacker it means most frequent. I have updated the text of the question too. Thanks — wayfare, May 25 '17 at 18:38
@YilongWang heap might not be a viable approach as millions of records might not fit in, they need to persisted to the disk. — Amit Kumar, May 25 '17 at 19:00
What's the expected output? A list of the records over the last X time units, or perhaps a count of them or some other aggregation? — Bernhard Barker, May 25 '17 at 19:31

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

The easiest way to to figure out this is to look at how other stream processing or map reduce libraries do this (and I have feeling your interviewers have seen these libraries). Its basically real time map reduce (you can lookup how that works as well).

I will outline two techniques for event processing. In reality most companies need to do both.

New school Stream processing (real time)

Lets assume for now they don't want the actual events but the more likely case of aggregates (I think that was the intent of your question)

An example stream processing project is pipelinedb (they have how it works on the bottom of their home page).

Events go into use a queue/ring buffer
A worker process reads those events in batches and rolls them up into partial buckets or window.
Finally there is combiner or reducer which takes the micro batches and actually does the updating. An example would be event counts. Because we are using a queue from above events come in ordered and depending on the queue we might be able to have multiple consumers that do the combing operation.

So if you want minute counts you would do rollups per minute and only store the sum of the events for that minute. This turns out to be fairly small space wise so you can store this in memory.

If you wanted those counts for month or day or even year you would just add up all the minute count buckets.

Now there is of course a major problem with this technique. You need to know what aggregates and pivots you would like to collect a priori.

But you get extremely fast look up of results.

Old school data warehousing (partitioning) and Map Reduce (batch processed)

Now lets assume they do want the actual events for a certain time period. This is expensive because if you store all the events in one place the lookup and retrieval is difficult. But if you use the fact that time is hierarchal you can store the events in a tree of tuples.

Reasons you would want the actual events is because you are doing adhoc querying and are willing to wait for the queries to perform.

You need some sort of queue for the stream of events.
A worker reads the queue and partitions the events based on time. For example you would have a partition for a certain day. This is akin to sharding. Many storage systems have support for this (e.g postgres partitions).
When you want a certain number of events over a period you union the partitions.
The partitioning is essentially hierarchal (minutes < hours < days etc) which means you can do tree like operations on them.

There are certain ways to store such events which is called time series data such that the partitioning index is automatic and fast. These are called TSDBs of which you can google for more info.

An example TSDB product would be influxdb.

Now going back to the fact that time (or at least how humans represent it) is organized tree like we can we can preform parallelization operations. This because a tree is DAG (directed acyclic graph). With a DAG you can do some analysis and basically recursively operate on the branches (also known as fork/join).

An example generic parallel storage product would citusdb.

Now of course this method has a massive draw back. It is expensive! Even if you make it fast by increasing the number of nodes you will have to pay for those nodes (distributed shards). An in theory the performance should scale linearly but in practice this does not happen (I will save you the details).

score 3 · Answer 2 · answered May 25 '17 at 19:20

I think you will need to persist the data to the disk as

the query duration is super vague, and data might be loss due to some unforeseen circumstances like process killed, machine failure etc.
you can't keep all the events in memory due to memory constraints(millions of events)

I would suggest using mysql as the data store with taking timestamp as one of the index key. But two events might have same timestamp. So make a composite index key with auto-increment id + timestamp.

Advantages of Mysql:

Super-reliable with replication
Support all kinds of CRUD operations and queries

On each query you can basically get the range of the timestamps as per your need.

First count the no. of events satisfying the query.

select count(*) from `events` where timestamp >= x and timestamp <=y.

If too many events satisfy the query, query them in batches.

 select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 0; 
 select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 1000;

and so on.. till offset <= count of events matching the first query.

If someone answers "use an SQL database" to a question like this in an interview, I imagine the very next question would be "can you explain how you'd do this from a data structures and algorithms point of view, i.e. without using a database". Using the right tools for the job is important, but interviews tend to be more about being able to solve a problem from scratch instead of just throwing an existing tool at it. — Bernhard Barker, May 25 '17 at 19:34
@Dukeling Agree on that part however, he mentioned queries with duration like 3 months or so, and the topic states design a system.. which i thought might use external tools. — Amit Kumar, May 25 '17 at 20:48

How to design a system in which we can query top results in last n hours

2 Answers2

New school Stream processing (real time)

Old school data warehousing (partitioning) and Map Reduce (batch processed)