Understanding Persistent Entities with streams of data

Question

I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.

What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.

How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?

What are the business rules regarding that Aggregate? What are the invariants that it protects? — Constantin Galbenu, Oct 02 '17 at 17:06
Nothing, it should just append each incoming message in the database. — user3139545, Oct 02 '17 at 17:17
then you don't need an Aggregate, at least not an Event-sourced one. Maybe some kind of stream processor? — Constantin Galbenu, Oct 02 '17 at 17:19
My thought was that with Lagom you use event sourcing and in my case I would have some command triggered every time an incoming message arrive. That command would trigger an event wich contains the incoming message wich is then persisted. If I use a stream processor I would just use a regular Cassandra client and write each incoming message in a forEach type of fashion on the incoming stream. Should I not use event sourcing for all my persistance needs? — user3139545, Oct 02 '17 at 17:26
I don't see the point as you don't have any invariants to protect. — Constantin Galbenu, Oct 02 '17 at 17:32

score 2 · Accepted Answer · answered Oct 03 '17 at 00:38

Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?

Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.

The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.

Maybe you could add an example on how to use Cassandra with Lagom without Persistent module? — user3139545, Oct 04 '17 at 16:06

Understanding Persistent Entities with streams of data

1 Answers1