I'm working on an app that requires clients to subscribe to some rows of a "Data" DynamoDB table. Clients should receive an initial snapshot, and streaming updates through a WebSocket connection.
What is the most efficient way to do so? Or, more precisely...
My current plan is to
- Listen to the "Data" table's change stream with some lambda
- Have this lambda forward the event to a SNS FIFO queue. Another lambda processes this queue by...
- Querying interested WebSocket subscribers from some DynamoDb "Subscribers" table (similar to this example)
- Push the update out to those WebSocket connections
When a subscriber comes in, I plan to
- Add their WebSocket connection ID to the "Subscribers" table (so they should receive delta updates from that point on), and then afterwards
- Query a current snapshot of the data, and push that out to the subscriber
Now of course a client might thereby receive a delta update before it receives the snapshot, but that's not an issue in my case (data is versioned and those conflicts can be managed by the client).
My concern is that by default, step 3 - querying current subscribers - would need to be a strongly consistent database read, otherwise a subscriber might miss out on an update (eg: We send out an initial snapshot. An update comes in, but due to eventual consistency, step 3 doesn't see the new subscriber yet - so they miss out!)
That kind of sucks, because we likely need to query subscribers quite often (every time an update occurs), and having to do consistent reads will slow things down - and make them more expensive from a billing perspective!
Are there any options to improve this?
Ideally, I'd like to insert a step after step 5 (and before step 6) that is "wait until the data has been pushed out to all replicas, so all weak reads after this will pick up the new subscriber". But I don't think that is possible - please do correct me if wrong.
Otherwise, I'm considering adding a timestamp to the Subscribers table. Step 3 could then lodge two separate queries - a weakly consistent read for Subscribers where Timestamp <= now - 10 minutes, and a strongly consistent read for Subscribers where Timestamp > now - 10 minutes. That kind of implies it'd be safe to assume a subscriber that came in longer than 10 minutes ago should now have propagated to all nodes, and every weakly consistent read "should" know about them by now. I don't need to say: This feels VERY dodgy!
I'd be keen to hear better ideas, or thoguhts on how bad my dodgy idea really is.