There is an event processing application that is required to process events in real or near real time. It is expected to get 5000-10000 messages a minute. To process the incoming events it's needed to fetch additional data elements.
For the sake of the example let's consider finance area. Thus incoming events are transactions and processing is represented by validating them through a number of business rules. Additional data elements are various and include (but not limited with) account information, client information and the previous transactions of this particular account(important!). Let's say that to process some transactions we need to look back in history for 100 days. Also it's worth mentioning that the processing of events is quite a complex one and one of the requirements is to have a powerful querying language to support different patterns to fetch data.
The question is what solution/product to choose to store and fetch data for processing such events.
Let's make an assumption that the volume of data is high thus a relational database is not an option at all. So the solution should be easily scaled out.
What is in my mind currently:
- HDFS + Spark
- HDFS/HBase + Spark
- Cassandra + Spark
Any thoughts on this?