How to implement an optimistic (or pessimistic) locking when using two databases that need to be in sync?

Question

I'm working on a solution in which we have two databases that are used with the following purposes:

An Elasticsearch used for search purposes
A Postgres database that acts as a source of truth for the data

Our application allows users to retrieve and update products, and a product has multiple attributes: name, price, description, etc... And two typical use cases are:

Retrieve products by name: a search is performed using elasticsearch, and then the IDs retrieved by ES are used on a secondary query against Postgres to obtain the actual and trustworthy data (so we get fast searches on big tables while getting trustworthy data)
Update product fields: We allow users to update any product information (kind of a collaborative wiki). First we store the data in Postgres, and then into Elasticsearch.

However and as I feared and as the amount of people using the app increased, we've run into race conditions; if a user #1 changed the name of a product to "Banana" and then user #2 at the same time changed the name of a product to "Apple", sometimes in elasticsearch the last record saved would be "Banana" while in Postgres "Apple" would be the last value, creating a serious inconsistency between databases.

So I've ventured into reading about optimistic/pessimistic locking in order to solve my problem, but so far all articles I find relate when you only use 1 relational database, and the solutions offered rely on ORM implementations (e.g. Hibernate). But our combined storage solution of ES + Postgres requires more "ballet" than that.

What techniques/options are available to me to solve my kind of problem?

I don't think that your problem is optimistic/pessimistic locking. Your problem seems to be order related. If you guarantee the order of updates, for instance, naively, carry out the timestamp with the updates, and update only if timestamp is greater than the current one. Of course, your solution should be much more thoughtful. I would suggest reading on `event sourcing`, that would better apply as solution to this problem, especially when it comes to `eventually consistent` stores. — muradm, Oct 29 '18 at 12:42
The idea is that, you should not care the state of stores, but guarantee the order of arriving events, then eventually your stores will be consistent. — muradm, Oct 29 '18 at 12:44

Kamal Kunjapur · Accepted Answer · 2018-10-30T07:38:34.943

Well I may attract some critics but let me explain you in a way that I understand. What I understand is this problem/concern is more of an architectural perspective rather than design/code perspective.

Immediate Consistency and of course eventual consistency

From Application layer

For the immediate consitency between two databases, the only way you can achieve them is to do polygot persistence in a transactional way so that either the same data in both Postgres and Elasticearch gets updated or none of them would. I wouldn't recommend this purely because it would put a lot of pressure on the application and you would find it very difficult to scale/maintain.

So basically GUI --> Application Layer --> Postgres/Elasticsearch

Queue/Real Time Streaming mechanism

You need to have a messaging queue so that the updates are going to the Queue in the event based approached.

GUI --> Application Layer --> Postgres--> Queue --> Elasticsearch

Eventual consistency but not immediate consistency

Have a separate application, normally let's call this indexer. The purpose of this tool is to carry out updates from postgres and push them into the Elasticsearch.

What you can have in the indexer is have multiple single configuration per source which would have

An option to do select * and index everything into Elasticsearch or full crawl
- This would be utlized when you want to delete/reindex entire data into Elasticsearch
Ability to detect only the updated rows in the Postgres and thereby push them into Elasticsearch or incremental crawl
- For this you would need to have a select query with where clause based on the status on your postgres rows for e.g. pull records with status 0 for document which got recently updated, or based on timestamp to pull records which got updated in last 30 secs/1 min or depending on your needs. Incremental query
- Once you perform the incremental crawl, if you implement incremental using status, you need to change the status of this to 1(success) or '-1'(failure) so that in the next crawl the same document doesn't get picked up. Post-incremental query
- Basically schedule jobs to run above queries as part of indexing operations.

Basically we would have GUI --> Application Layer --> Postgres --> Indexer --> Elasticsearch

Summary

I do not think it would be wise to think of fail proof way rather we should have a system that can recover in quickest possible time when it comes to providing consistency between two different data sources.

Having the systems decoupled would help greatly in scaling and figuring out issues related to data correctness/quality and at the same time would help you deal with frequent updates as well as growth rate of the data and updates along with it.

Also I recommend one more link that can help

Hope it helps!

How to implement an optimistic (or pessimistic) locking when using two databases that need to be in sync?

1 Answers1

Immediate Consistency and of course eventual consistency

Eventual consistency but not immediate consistency

Summary