0

I have a use case where I have an incoming event stream containing event information about our customers. These events contain information such as the customers ID, order type, order $ value, item count in order, and other data.

I am trying to build a system that indexes this information in daily, weekly, monthly, and lifetime aggregates. Since the data comes in in real time, I'd like the processing to be in as real time as possible.

One option I was considering was to index this data manually as it comes in into these aggregation levels. The problem with this is that a sudden burst of late arriving events from the past can cause cascading upstream calculations at the weekly, monthly and yearly levels.

Another way was to simply throw all the data into a DB, and run some batch jobs to compute the summaries on some regular interval.

My end goal is to precompute snapshots and put them into DynamoDB or MongoDB to vend this info to our app clients.

This problem seems like a cross between data engineering/processing and building a data store to serve a real-time application. If anyone has any suggestions on approaches I'm all ears!

Note: In terms of scale, This system will process millions of events per day.

thebighoncho
  • 385
  • 2
  • 6
  • 20

1 Answers1

0

In my opinion, you should go for the batch job approach and do the evaluations at regular intervals.

Reason: Due to the occurrence of late events from the past that you are mentioning, you are forced to possibly recalculate anything at any moment, which will lead to difficult internal logic and has an unpredictable impact on performance.

Also, I would question your real-time requirement. Who is the receiver of the evaluation results and for what purpose and how often will these results be looked at? It might be well acceptable to provide results that do not yet include the last interval (let's say the current day or the current hour).

Gerd
  • 2,568
  • 1
  • 7
  • 20