How to share data between microservices without sync RPC (use topics as changelogs) and deal with consistency?

Question

I learned about using Kafka's topics as a changelog to avoid doing synchronous RPC, but I don't understand how we deal with consistency as topics are not persistent (retention policy).

i.e, I run an application, 2 microservices:

The User Service, is used to update users' data in the system(address, First Name, phone number...).
The Shipping Service, uses Users' data to create a shipping order and send it to the shipping company's system.

Each service has its own db to persist the Users' data.. To communicate any changes made on a User's data, the confluent's teacher proposed to create a topic and use it as a changelog. User Service inputs the changes, other microservices can consume.

But What if:

User X changed his address 1 year ago
the retention policy of the changelog is 6 months
today we add BillingService to the system.

The BillingService won't know the User X's address, so its view is inconsistent. Should I run a one-time "Call UserService to copy its full DB" when a new service enters the system? Seems ugly solution.

More tricky and challenging:

We have a changelog with a retention policy of time T
A consumer service failed more than time T

Therefore, it will potentially miss some changelogs. How do we deal with that? We are never confident how the service knows everything it has to know about the users.

Did some research, but found nothing. I really think I don't have enough vocabulary yet to do good research, as the problem sounds pretty common to everyone. Sorry if it exists a source dedicated to this problem that I did not find!

score 1 · Answer 1 · answered Mar 26 '22 at 14:26

If the changelog topic is covering entities that are of unbounded lifetime (like your users, hopefully), that strongly suggests that the retention period for that topic should be infinite. Chances are that topic is sufficiently low volume that infinite retention is viable (consider that it can probably be partitioned).

If for some reason that's not viable, you can arrange for producers to at some period shorter than the retention period publish out "this is the state of this entity" for every entity they own to the topic. For entities which don't change very much, this is pretty wasteful and duplicative (but for those a very long to infinite retention period is more viable), for entities which do change a lot, this is a rounding error in terms of volume.

That neatly solves the first case and eventually allows for the second to be solved. For the second, there is basically no solution, which means that you have to choose the retention period for a topic such that you can guarantee that no consumer of this topic will ever be down (or not deployed) for longer than the retention period: this typically means that a retention period shorter than, say, 7 days, should be really heavily scrutinized. Note that if you have a 1 week retention period and a consumer has been down for more than a few days, you can temporarily bump up the retention period to buy you time for the consumer to get fixed, and if there's a consumer which can be down for more than a week without anybody noticing, how important is that consumer, really?

Thank you Levi, your answer is helpful to think about the problem. As you say, forcing the producer to publish a non changing entity's state would be wasteful, it's a bad design I think. — Julien Elkaim, Mar 27 '22 at 03:21
Also, applied to one of my use cases, the data shared is very big, unfortunately (It is "TaskExecuted" events, so it includes the task results of data processing from multiple file sources. Could be MBs, occasionally GBs). 3 microservices need this information. Maybe that's where "decoupling" microservices reaches the limit? And using Kafka topics for such big events is strange I would say. Maybe a shared DB to store the MBs, using in parallel a kafka topic to publish events with a reference to the execution result's MBs in the database is a better solution? — Julien Elkaim, Mar 27 '22 at 03:31
Or using the sync RPC that I wanted to avoid in the first place is a choice that make sense... — Julien Elkaim, Mar 27 '22 at 03:32
Yeah, Kafka does not like messages which are much bigger than tens of kilobytes in my experience, and definitely not messages that are bigger than a megabyte (having to raise the max message size for a topic is a sign that you might be heading down a painful path with Kafka). I would write them to an object store (e.g. S3 (or your cloud provider's equivalent) or minio if on-prem) and publish a path to the store. Those paths should be small enough that you can have very long to infinite retention. — Levi Ramsey, Mar 27 '22 at 12:59

score 0 · Answer 2 · answered Mar 28 '22 at 15:39

This is quite common issue in replication - a node goes offline for a significant amount of time. For example, a node's hardware completely failed/lost and it takes weeks to order/get new one.

In that case, in distributed systems, we don't do fail recovery, but we provision a new node as a replacement. That new node is completely empty, hence it needs some initial state.

If your queue has all events since the beginning of time, you could apply those events one by one to the node - that would do the job - but in a very inefficient way (imagine processing years of data).

There is a better process - first restore data for the new node from the most recent backup, and then reapply newer items.

Backing up data is important. Every Microservices should do its own job saving/restoring its data. As a result, the original Kafka system won't have to keep data forever.

As a quick summary: in distributed replication these are two different problems - catching up a lagging node and provisioning a new node. And if a node is lagging for a long time, then this becomes provisioning problem.

How to share data between microservices without sync RPC (use topics as changelogs) and deal with consistency?

2 Answers2