0

Purely Avro Question First

Is it possible to have an Avro schema compatible with the following message whose before and after fields must be of the same record type:

{
  "before": null,
  "after": {
    "id": 1,
    "name": "Bob"
  },
  "op": "c"
}

When I define a schema like the following:

{
  "type": "record",
  "name": "Envelope",
  "fields": [
    {
      "name": "before",
      "default": null,
      "type": [
        "null",
        {
          "type": "record",
          "name": "Value",
          "fields": [
            {
              "name": "id",
              "type": "int"
            },
            {
              "name": "name",
              "type": "string"
            }
          ]
        }
      ]
    },
    {
      "name": "after",
      "default": null,
      "type": [
        "null",
        "Value"
      ]
    },
    {
      "name": "op",
      "type": "string"
    }
  ]
}

However, with that ^^ schema, it expects an extra nested layer it only accepts messages in the following form:

{
  "after": {
    "Value": {
      "id": 10,
      "name": "Bob"
    }
  },
  "before": null,
  "op": "c"
}

I've read through Avro's documents and examples such as this one and apparently this is the expected/normal behavior :O Why do I need the extra Value object there as opposed to having those fields directly at the root level of the after or before field?

Some Additional Context

Essentially I'm using Debezium-Postgres Kafka-Connector along with Confluent's Schema Registry in order to:

  1. Capture change-log events associated with my upstream Postgres DB
  2. Then Avro-Serialize-and-Publish those events to a Kafka topic
  3. Finally, I'm ingesting said Kafka topic via Hudi's DeltaStreamer to store/index them into a Blob Storage like S3 as a Data Lake.

WRT #3, DeltaStreamer expects the fields to be at the root level of the after or before fields as you see in this line and this line. But instead, it's receiving after.Value.<...> or before.Value.<...>.

Key Questions

  1. In pure Avro, is it possible to have a schema that allows before.id and the like, instead of before.Value.id?
  2. IF so, how can I configure my Confluent's AvroConverter to infer the schema and that way serialize change-log events accordingly?

Thanks in advance, really appreciate any help/thoughts on this.


Update-1

This update is based on Nic Pegg's response to the original question laid out above.

The behaviour you're seeing is due to the way Avro handles union types. When you specify a union type for a field in Avro, like how Debezium handles "before" and "after" fields, Avro uses a JSON encoding style to wrap the actual value in a JSON object whose key is the schema name; this determines which type of union is used for each datum.

Given this is Avro's normal behavior, I'm surprised others who have dealt with PostgreSQL => Debezium => Kafka => Apache Hudi have not experienced this same issue, at least based on my searching around. For instance, based on conversations I've seen in threads like this one or this one. It makes me think maybe I'm mistaken about my hypothesis on Hudi not being able to handle these Value-nested messages :/

Just so you know, this transformation will fail if the after field is null or does not have a Value field, so you may want to add additional checks or transformations to handle these cases.

Given these are DB change-log events, there will be events with after = NULL (i.e. when a DELETE statement is applied) and also events with before = NULL (i.e. when an INSERT statement is applied). So is there any single SMT configuration/combination that would turn these messages:

// Message 1
{
  "after": {
    "Value": {
      "id": 10,
      "name": "Bob"
    }
  },
  "before": null,
  "op": "c"
}

// Message 2
{
  "after": null,
  "before": {
    "Value": {
      "id": 10,
      "name": "Bob"
    }
  },
  "op": "d"
}

Into these ones:

// Message 1
{
  "after": {
    "id": 10,
    "name": "Bob"
  },
  "before": null,
  "op": "c"
}

// Message 2
{
  "after": null,
  "before": {
    "id": 10,
    "name": "Bob"
  },
  "op": "d"
}

Note that this solution might not work for all Debezium payloads, especially those with complex nested schemas. You might need to create a custom SMT to handle more complex transformations.

IF simple SMT configuration won't be able to handle this un-nesting/unwrapping, at that point it might make more sense to implement such transformations at the Hudi stage of this data-flow. But it would be ideal not needing extra transformer(s) to just unwrap the Value object.

samser
  • 83
  • 6

1 Answers1

1

The behaviour you're seeing is due to the way Avro handles union types. When you specify a union type for a field in Avro, like how Debezium handles "before" and "after" fields, Avro uses a JSON encoding style to wrap the actual value in a JSON object whose key is the schema name; this determines which type of union is used for each datum.

This is because some languages (like JavaScript) don't have strong typing, so it's only sometimes possible to determine the type of a value at runtime unambiguously. However, by wrapping the value in a JSON object with a key corresponding to the schema name, Avro can always determine the schema that should be used to interpret a particular piece of data.

So, in your example, the "after" field is defined as a union of "null" and "Value". When you provide data for the "after" field, Avro expects it to be in the form:

"after": {
    "Value": {
      "id": 10,
      "name": "Bob"
    }
}

This tells Avro that the "after" field data should be interpreted according to the "Value" schema. For example, if the "after" field were null, you would represent it as:

"after": null

This is why you see that extra layer of nesting in your data.

To get the desired behaviour, you must change your schema so that "before" and "after" are not union types. However, this would mean that you couldn't represent null values for those fields, which may not be what you want and would likely be incompatible with Debezium without using Single Message Transforms to alter the payload.

Single Message Transforms (SMTs) in Kafka Connect can transform the data before it is written to Kafka. Debezium, a source connector in Kafka Connect, is used to capture and stream database changes into Kafka. SMTs can be applied to the data Debezium sends to Kafka.

For instance, you could use the ExtractField$Value SMT to unwrap the Value field from the after field. Here's an example configuration:

transforms=unwrapAfter
transforms.unwrapAfter.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.unwrapAfter.field=after.Value

This configuration creates a transformation called unwrapAfter that uses the ExtractField$Value SMT. The field parameter tells the SMT which field to extract.

This SMT will operate on the entire record value and replace it with the Value field of the after field. Just so you know, this transformation will fail if the after field is null or does not have a Value field, so you may want to add additional checks or transformations to handle these cases.

Also, please keep in mind that SMTs operate on each message individually and have no knowledge of other messages. This means they cannot, for example, transform a field based on the value of that field in a previous or future message.

Note that this solution might not work for all Debezium payloads, especially those with complex nested schemas. You might need to create a custom SMT to handle more complex transformations.

Nic Pegg
  • 485
  • 3
  • 7
  • Thank you very much for the thorough response and the detailed explanation Nic, I really appreciate it. This makes sense to me in general. I have some follow-up questions though, so I'll lay them out in an "Update-1" section in the original question. Please check it out when you get a chance and let me know what you think. Thanks a lot again :) – samser May 13 '23 at 00:32
  • Are you having issues following that blog post? From my understanding, the class PostgresDebeziumAvroPayload using Spark should seamlessly handle Debezium Avro payloads – Nic Pegg May 15 '23 at 21:08
  • So I did some more digging and debugging and added some details in [this GH issue comment](https://github.com/apache/hudi/issues/8519#issuecomment-1550008181). Please check it out (the follow-up comments there) and let me know what you think. TL;DR -- The nested Value object was a red herring (most likely). The real concern is why the "source" object's schema in the payload published by Debezium, does not exactly match what's laid out in [the Debezium's documentation](https://debezium.io/documentation/reference/stable/connectors/postgresql.html#postgresql-create-events)?! – samser May 16 '23 at 23:45
  • Are you able to post a full example of the Debezium payload as it appears in the Kafka topic? – Nic Pegg May 17 '23 at 17:23
  • Sure ... You can see an example complete message I consumed using Confluent's **kafka-avro-console-consumer** mentioned in [this GH issue description](https://github.com/apache/hudi/issues/8761#issue-1716536714) under the "**Additional Context**" section. Please let me know in case you need more information. At this point, I'm not 100% sure IF the issue is solely on the Hudi side, or **also** due to Debezium publishing messages with inaccurate/unexpected structure (i.e. `source` blob). – samser May 19 '23 at 17:07