Data Model/Schema decoupling in Data Processing Pipeline suing Event Driven Architecture

Question

I was wondering how Microservices in the Streaming Pipeline based on Event Driven Architecture can be truly decoupled from the data model perspective. We have implemented a data processing pipeline using Event-Driven Architecture where the data model is very critical. Although all the Microservices are decoupled from the business perspective, they are not truly decoupled as the data model is shared across all the services.

In the ingestion pipeline, we have collected data from multiple sources where they have a different data model. Hence, a normalizer microservice is required to normalize those data models to a common data model that can be used by downstream consumers. The challenge is Data Model can change for any reason and we should be able to easily manage the change here. However, that level of change can break the consumer applications and can easily introduce a cascade of modification to all the Microservices.

Is there any solution or technology that can truly decouple microservices in this scenario?

I couldn't find any as a guy who has been using microservices for more than 5 years now. But when I see "not truly decoupled as the data model is shared" it sounds to me like you may have "distributed monolith" instead of microservices. If you need same data model for all your microservices most likely you are trying to deal with same context in different microservices which only make sense if you have very high scalability load for some operation you are dealing with. Otherwise you could have just one microservice which is responsible from the whole operations for a specific context. — cool, Mar 02 '19 at 20:38
@cool It is not that we need to use the same data model between different microservices. Every service can use their own data model, but it should be mapped to the internal data model of the corresponding microservice. Imagine you have a data pipeline. Every Microservice do something to your data and pass it to the next one. Therefore, every one of them should understand what the data model is. — Ali, Mar 03 '19 at 03:56

score 0 · Answer 1 · answered Mar 02 '19 at 17:59

This problem is solved by carefully designing the data model to ensure backward and forward compatibility. Such design is important for independent evolution of services, rolling upgrades etc. A data model is said to be backward compatible if a new client (using new model) can read / write the data written by another client (using old model). Similarly, forward compatibility means a client (using old data model) can read / write the data written by another client (using new data model).

Let's say in a Person object is shared across services in a JSON encoded format. Now a one of the services introduces a new field alternateContact. A service consuming this data and using the old data model can simply ignore this new field and continue its operation. If you're using Jackson library, you'd use @JsonIgnoreProperties(ignoreUnknown = true). Thus the consuming service is designed for forward compatibility.

Problem arises when the service (using old data model) deserializes a Person data written with the new model, updates one or more field values and writes the data back. Since the unknown properties are ignored, the write will result in data loss.

Fortunately, binary encoding format such as Protocol Buffer 3.5 and later versions preserve unknown fields during deserialization using old model. Thus when you serialize the data back, the new fields remain as is.

There maybe other data model evolutions hou need to deal with like field removal, field rename etc. The nasic idea is you need to be aware of and plan for these possibilities early on in the design phase. The common data encoding formats are JSON, Apache Thrift, Protocol Buffer, Avro etc.

I think there are two issues here : 1) How to truly keep the microservices decoupled from the data model 2) How to normalize different data models to a common data model that can be used by downstream consumers. I guess using something like Avro and Schema Registry help with simplifying the first item, but how would be the best approach to deal with the second item? There will be lots of scenarios where the data model changes would be beyond backward/forward compatibility in the case that we want to map data from different models to a single common model. — Ali, Mar 03 '19 at 04:00

Data Model/Schema decoupling in Data Processing Pipeline suing Event Driven Architecture

1 Answers1