Flink write to hudi with different schemas extracted from kafka datastream

Question

So I have a Kafka topic which contains avro record with different schemas. I want to consume from that Kafka topic in flink and create a datastream of avro generic record.(this part is done)

Now I want to write that data in hudi using schema extracted from datastream. But since hudi pipeline/writer takes a config with predefined avro schema at the beginning itself, I can't do that.

Probable solution is to create a key stream based on a key they identify one type of schema and then extract schema from that and then create a dynamic hudi pipeline based on that.

I'm not sure on the last part, if that is possible.

A-->B-->C

Where A is generic avro record with different schemas. B is partitioned stream based on different schemas. And C uses that schema inside data stream B to create a config and pass it to hudi pipeline writer function.

score 0 · Answer 1 · answered Oct 22 '22 at 16:24

I think I just answered this same question (but for Parquet) yesterday, but I can't find it now :) Anyway, if you know the different schemas in advance (assumed so, otherwise how do you deserialize the incoming Kafka records?) then you can use a ProcessFunction with multiple side outputs, where you split the stream by schema, and then connect each schema-specific stream to its own Hudi sink.

Flink write to hudi with different schemas extracted from kafka datastream

1 Answers1