0

I just watched Turning the database inside-out and noticed a similarity between Samza and Redux: all state consists of a stream of immutable objects.

This made me realize that if you edited the stream after-the-fact, you could in theory regenerate all materialized views based on the new list of transactions and in effect "undo" a past change to the database.

As an example, suppose I had the following series of diffs:

1. Add user "tom"
2. Add user "bob"
3. Delete user "bob"
4. Change user "tom"s name to "joe"
5. Add user "fred"

After this series of changes, our database looks like:

+-------+
| users |
+-------+
| joe   |
| fred  |
+-------+

Now what if I wanted to undo number "3"? Our new set of diffs would be:

1. Add user "tom"
2. Add user "bob"
4. Change user "tom"s name to "joe"
5. Add user "fred"

And our database:

+-------+
| users |
+-------+
| joe   |
| bob   |
| fred  |
+-------+

While this sounds good in theory, can this actually be done using Samza, Storm, or Spark? Can any transaction-stream database do this? I'm interested in such functionality for administrative purposes. I have some sites where clients have accidentally deleted an employee or modified records they didn't mean to. In the past I solved this by creating a separate table which recorded all changes to the database, then when an issue arose I could (manually) look at this table, figure out what they did wrong, and (manually) fix the data.

It would be SO much cooler if I could just look at a transaction stream, remove the bad one, and say "regenerate the database"

stevendesu
  • 15,753
  • 22
  • 105
  • 182
  • I am not familiar with the details of all three system. For Spark and Storm, there is no system support to get this done, AFAIK. I have my doubts, that you can do in in Spark at all (it not really streaming what Spark offers). For Storm maybe -- however, where do you want to populate the database? No clue about Samza. I am also wondering, how do you want to "delete" a record from the transaction stream? I understand your basic idea, but to give a good answer, some more details would be required. For example, if you use Kafka to store your stream, deleting a record is not possible... – Matthias J. Sax Aug 03 '16 at 19:31
  • My assumption is that the event stream would be written to disk (these are immutable facts and don't need to be updated or rewritten) and you'd have in-memory materialized views of the aggregated end-results of the streams. If an event were removed or modified it would need to recompute the materialized views, which could be pricey - but you shouldn't be rewriting history that often. It'd be like git for databases. I've considered building a system to do this entirely within MySQL since I can't find any software that does it already. – stevendesu Aug 06 '16 at 13:45
  • What I do still not understand. You claim that the data is immutable facts. Later you say, you want to remove or rewrite a fact -- but if it's immutable, you cannot do this... So I guess you data is not immutable. And if it is not, why not just use a database? – Matthias J. Sax Aug 06 '16 at 13:51
  • Anyway, if you data stream is a changelog and you want to remove an item in it an rebuild the result table from this, you can use Spark or Storm or Samoa or Flink. Just reread all data an build the new table. Of course you need to trigger this job manually. – Matthias J. Sax Aug 06 '16 at 13:55

0 Answers0