4

I've been reading some articles and questions on eventual consistency and choreographing microservices, but I haven't seen a clear answer to this question. I'll phrase it in generic terms.

In a nutshell: if a client historically makes subsequent synchronous REST calls to your system, what do you do when the later calls may return unexpected results once the calls are made to different microservices (due to eventual consistency)?

Problem

Suppose you have a monolithic application that provides a REST API. Let's say there are two modules A and B you want to convert to microservices. The entities that B maintains can refer to entities that A maintains (e.g. A maintains students and B maintains classes). In the monolithic situation, the modules simply refer to the same database, but in the microservices situation, they each have their own database and communicate via asynchronous messages. So their databases are eventually consistent with respect to each other.

Some existing third-party client applications of our API are used to first (synchronously) calling an endpoint belonging to module A and, after that first call returns, immediately (i.e. a few ms later) calling an endpoint in module B as part of their workflow (e.g. creating a student and putting it in a class). In the new situation, this leads to a problem: when the second call happens, module B may not be aware of the changes in module A yet. So the existing workflow of the client application may break. (E.g. module B may respond: the student you're trying to put in the class doesn't exist, or it is in the wrong year.)

When the calls are done separately by a human user through some frontend application, this is not a big issue, as the modules are usually consistent after a second anyway. The problem arises when a client application (which is not under our control) just calls A and then immediately B as part of an automated workflow. The eventual consistency is simply not fast enough in this instance.

A simple diagram that describes the situation

Question

Is there a best practice, or a generally agreed upon set of options, to mitigate this problem? (I made up the student/class example, don't get hung up on the specifics of that. :))

What we can think of

  • Simply telling the developers of these clients: from now on, you have to implement a retry mechanism for every endpoint you call. The drawback seems obvious.
  • Implement an API gateway that waits until B is ready. Drawback: there are many conceivable workflows (involving more modules A-Z) that would require this, so the gateway might become quite complex.
  • Somehow create a "session" for the client that tracks which requests it has made in succession. Then B can figure out whether it should wait for a message from A, or it could even update its state just by looking at the precise request the client made to A.

Are there better methods? Which would be most suitable?

Edit: Clarified that the question primarily concerns the behaviour of third-party clients that call the endpoints in an automated way, meaning that even a few milliseconds 'lag' in the eventual consistency can be fatal.

  • I believe this to be a genuine issue when trying to decouple services, and possibly one of the first to appear. I'm a bit lost, seeing that this problem does not have a preferred set of solutions. – Askar Kalykov Nov 17 '22 at 16:53

2 Answers2

2

The strong consistency centric solution of this problem is based on distributed transactions, which unfortunately come with high complexity and performance implications.

In this amazing article around monolith to microservices migration, Zhamak Dehghani addresses the data inconsistency too:

Distributed transactions are notoriously difficult to implement and as a consequence microservice architectures emphasize transactionless coordination between services, with explicit recognition that consistency may only be eventual consistency and problems are dealt with by compensating operations.

So eventual consistency is the only data consistency option in a microservices-based architecture, and if you need strong-consistency guarantees, then you need to build work-arounds (compensating operations), like retry flows, which will add additional complexity.

Moreover, the article highlights a really insightful way of seeing the data inconsistency with respect to the business workflows:

Choosing to manage inconsistencies in this way is a new challenge for many development teams, but it is one that often matches business practice. Often businesses handle a degree of inconsistency in order to respond quickly to demand, while having some kind of reversal process to deal with mistakes. The trade-off is worth it as long as the cost of fixing mistakes is less than the cost of lost business under greater consistency.

Here is the way I see this problem:

  • It's true that the storages between microservice A and B get updated in an async way, but what is the exact latency of this update workflow? If we're talking about 1 - 2 seconds, then the inconsistency may be perceived by the users at all. Otherwise, the system should be scaled out to support this (or even lower) latency threshold.
  • You can monitor the inconsistency events - when an user tries to fetch data which doesn't exist in a storage because it's in the update process, and scale your system based on that.
  • Bottom line is that it may help measuring out the need for such a consistency guarantee, and then apply a suitable workaround.
Cosmin Ioniță
  • 3,598
  • 4
  • 23
  • 48
  • Thanks for your explanation and the link to the article! Reading your bullet points I realize I failed to stress one important aspect of my question: I was not thinking about the _human_ clients of our application. You're quite right that they probably won't perceive a lag of 1 - 2 seconds. My question concerns _software_ clients, who fire the request to module A and B in quick succession (as quick as their programming and infastructure allows them to). I will edit the question accordingly. – Merlin's Beard Mar 30 '21 at 19:46
  • 1
    Thanks for the clarifications. In this case, I believe it's way easier to deal with this problem by building retry flows (compensating operations), or as a last resort, by having the same storage between the services A and B – Cosmin Ioniță Mar 30 '21 at 20:29
  • Just wanted to note that while the article is hosted at martinfowler.com, it was actually written by Zhamak Dehghani. – Ian Reasor Aug 24 '21 at 20:01
2

Is there a best practice, or a generally agreed upon set of options, to mitigate this problem?

Yes. You can't break up every method into its own microservice with its own repository.

You scope your microservices and repositories to accommodate genuine requirements for strong consistency. If you have a use case where a call to service endpoint A is followed immediately by a call to service endpoint B which needs to see the results of the first call then A and B should be part of the same microservice or share the same repository.

David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67
  • Perhaps I can phrase the gist of my question in these terms then: what if _I_ don't have a use case for a call to B immediately after A, but I discover (after separating A and B) that some _other_ consumer of my API apparently _does_ have such a use case? More generally, if I divide up my monolithic system in modules according to my own use cases, how do I deal with other consumers who 'randomly' (as it seems to me) call different parts of my system and expect that to keep working? Is the answer "you can't", or is there a 'best practice workaround' for this? – Merlin's Beard Apr 03 '21 at 12:33
  • 1
    You normally don't have strangers appear and mutate your data, but all you can do in that scenario is tell the consumer to wait for your system to reach consistency with a delay or retry. And you might come to feel that you divided your system to aggressively. And note that it's acceptable for multiple microservices to share a repository, and for a repository to store more than one thing. The grouping is always a balancing act. – David Browne - Microsoft Apr 03 '21 at 15:51
  • I'll clarify the 'strangers' bit some more, in case that changes the answer. :) Our application basically consists of a backend with a public API and several frontend applications. Many customers interact with our system solely through one of our frontends, but anyone is free to call our API directly to integrate our system into their own. That's also a large portion of the client base, and this is what causes the problem. These integrations were not built with eventual consistency in mind. We would like to split our microservices while minimizing the need for rewrites at the customer's end. – Merlin's Beard Apr 12 '21 at 18:16
  • 1
    I think that will constrain the degree to which you can split your repositories. If clients have their own data, you can "shard" the repository, so each client works against a single, consistent view of their data, while other clients hit different repositores. – David Browne - Microsoft Apr 12 '21 at 18:24