0

In Stream Processing applications (f. e. based on Apache Flink or Apache Spark Streaming) it is sometimes necessary to process data exactly once.

In the database world something equal be achieved by using databases that follow the ACID criteria (correct me if I'm wrong here).

However there are a lot of (non relational) databases that do not follow ACID but BASE.

Now my question is: If I'm going to integrate such a BASE database into a stream processing application (exactly once), can I still guarantee exactly once processing for the whole pipeline? And if this is possible, under what circumstances?

MW.
  • 544
  • 5
  • 19

1 Answers1

1

Exactly Once Semantics means the processing framework such as flink can guarantee each incoming record(event) will be processed exactly one time even if the pineline fails in any way.

This is done by having checkpoints after each operation in the pineline, so that when the application recovers from failure, successful operation will not be executed again.

Depends on what kind of operations you are trying to do with databases, most cases databases are used as sinks for processing result to write into. In that case the operation involving database is just a simple insert and it will not be executed again after one successful run therefore it's still exactly-once regardless of its ACID support.

You might be tempted to group operations together for databases that support ACID but it will be a bad practice in a parallel streaming pineline since they created mutilple transactions and the locks might block the whole process. Instead, use BASE (NoSQL) database that are fast with intensive read and update performance is preferable, you just need to make your operations to be idempotent so that partially re-executed statements (if they failed half way through then after recovery they might be executed all again) won't result in incorrect data.

eric tan
  • 119
  • 2
  • Thanks for your reply. I can follow your arguments for simple use-cases. But I'm stuck in thinking about more complex scenarios. What's about a use-case in which a sink or another operator writes (intermediate) results into a DB while other operators (early or later in the pipeline) read this data. Is something like this handleable, f.e. by saving window IDs in the DB or implementing something else to keep things in sync? – MW. Apr 26 '21 at 12:25
  • 1
    if you are talking about if it's possible to keep intermediate data isolated from another operator, answer is no since Nosql databases don't do "transactions" because these are blocking operations which is not what they are designed for. but some NoSql databases such as cassandra still have batch statement support for atomic updates but it's not performant. in general, operators update data with the same key so as to avoid conflicts. but in more complex cases you will just have to carefully model your data structure and plan your operations to keep them in sync. – eric tan Apr 26 '21 at 14:28