I have a use case where I have an incoming event stream containing event information about our customers. These events contain information such as the customers ID, order type, order $ value, item count in order, and other data.
I am trying to build a system that indexes this information in daily, weekly, monthly, and lifetime aggregates. Since the data comes in in real time, I'd like the processing to be in as real time as possible.
One option I was considering was to index this data manually as it comes in into these aggregation levels. The problem with this is that a sudden burst of late arriving events from the past can cause cascading upstream calculations at the weekly, monthly and yearly levels.
Another way was to simply throw all the data into a DB, and run some batch jobs to compute the summaries on some regular interval.
My end goal is to precompute snapshots and put them into DynamoDB or MongoDB to vend this info to our app clients.
This problem seems like a cross between data engineering/processing and building a data store to serve a real-time application. If anyone has any suggestions on approaches I'm all ears!
Note: In terms of scale, This system will process millions of events per day.