-1

There is an event processing application that is required to process events in real or near real time. It is expected to get 5000-10000 messages a minute. To process the incoming events it's needed to fetch additional data elements.

For the sake of the example let's consider finance area. Thus incoming events are transactions and processing is represented by validating them through a number of business rules. Additional data elements are various and include (but not limited with) account information, client information and the previous transactions of this particular account(important!). Let's say that to process some transactions we need to look back in history for 100 days. Also it's worth mentioning that the processing of events is quite a complex one and one of the requirements is to have a powerful querying language to support different patterns to fetch data.

The question is what solution/product to choose to store and fetch data for processing such events.

Let's make an assumption that the volume of data is high thus a relational database is not an option at all. So the solution should be easily scaled out.

What is in my mind currently:

  1. HDFS + Spark
  2. HDFS/HBase + Spark
  3. Cassandra + Spark

Any thoughts on this?

Community
  • 1
  • 1
Stephen L.
  • 509
  • 2
  • 14

1 Answers1

1

100-200 events a second is not an huge scale but you didn't mention data sizes and other issues like the probability of getting several events that needs same or at least common data, how well can the data be sharded etc.

These types of questions greatly affect the relevant solutions, that said both HBase and Cassandra can be made to fetch data quickly enough for your purposes. Spark and HDFS would only fit if you can load all the needed data into memory (in which case you probably don't need HDFS anyway)

If you can fit all or most relevant data into memory you may want to look at in-memory data grids like apache ignite or apache geode

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68