Purely Avro Question First
Is it possible to have an Avro schema compatible with the following message whose before
and after
fields must be of the same record type:
{
"before": null,
"after": {
"id": 1,
"name": "Bob"
},
"op": "c"
}
When I define a schema like the following:
{
"type": "record",
"name": "Envelope",
"fields": [
{
"name": "before",
"default": null,
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
}
]
}
]
},
{
"name": "after",
"default": null,
"type": [
"null",
"Value"
]
},
{
"name": "op",
"type": "string"
}
]
}
However, with that ^^ schema, it expects an extra nested layer it only accepts messages in the following form:
{
"after": {
"Value": {
"id": 10,
"name": "Bob"
}
},
"before": null,
"op": "c"
}
I've read through Avro's documents and examples such as this one and apparently this is the expected/normal behavior :O Why do I need the extra Value
object there as opposed to having those fields directly at the root level of the after
or before
field?
Some Additional Context
Essentially I'm using Debezium-Postgres Kafka-Connector along with Confluent's Schema Registry in order to:
- Capture change-log events associated with my upstream Postgres DB
- Then Avro-Serialize-and-Publish those events to a Kafka topic
- Finally, I'm ingesting said Kafka topic via Hudi's DeltaStreamer to store/index them into a Blob Storage like
S3
as a Data Lake.
WRT #3, DeltaStreamer expects the fields to be at the root level of the after
or before
fields as you see in this line and this line. But instead, it's receiving after.Value.<...>
or before.Value.<...>
.
Key Questions
- In pure Avro, is it possible to have a schema that allows
before.id
and the like, instead ofbefore.Value.id
? - IF so, how can I configure my Confluent's AvroConverter to infer the schema and that way serialize change-log events accordingly?
Thanks in advance, really appreciate any help/thoughts on this.
Update-1
This update is based on Nic Pegg's response to the original question laid out above.
The behaviour you're seeing is due to the way Avro handles union types. When you specify a union type for a field in Avro, like how Debezium handles "before" and "after" fields, Avro uses a JSON encoding style to wrap the actual value in a JSON object whose key is the schema name; this determines which type of union is used for each datum.
Given this is Avro's normal behavior, I'm surprised others who have dealt with PostgreSQL => Debezium => Kafka => Apache Hudi
have not experienced this same issue, at least based on my searching around. For instance, based on conversations I've seen in threads like this one or this one. It makes me think maybe I'm mistaken about my hypothesis on Hudi not being able to handle these Value-nested messages :/
Just so you know, this transformation will fail if the after field is null or does not have a Value field, so you may want to add additional checks or transformations to handle these cases.
Given these are DB change-log events, there will be events with after = NULL
(i.e. when a DELETE statement is applied) and also events with before = NULL
(i.e. when an INSERT statement is applied). So is there any single SMT configuration/combination that would turn these messages:
// Message 1
{
"after": {
"Value": {
"id": 10,
"name": "Bob"
}
},
"before": null,
"op": "c"
}
// Message 2
{
"after": null,
"before": {
"Value": {
"id": 10,
"name": "Bob"
}
},
"op": "d"
}
Into these ones:
// Message 1
{
"after": {
"id": 10,
"name": "Bob"
},
"before": null,
"op": "c"
}
// Message 2
{
"after": null,
"before": {
"id": 10,
"name": "Bob"
},
"op": "d"
}
Note that this solution might not work for all Debezium payloads, especially those with complex nested schemas. You might need to create a custom SMT to handle more complex transformations.
IF simple SMT configuration won't be able to handle this un-nesting/unwrapping, at that point it might make more sense to implement such transformations at the Hudi stage of this data-flow. But it would be ideal not needing extra transformer(s) to just unwrap the Value
object.