3

I have messages in Avro format in Kafka. These have to converted to table and selected using SQL, then converted to stream and finally sink. There are multiple Kafka topics with different Avro schemas, hence dynamic tables are required.

Here is the code which I am using

StreamExecutionEnvironment env = ...;
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

FlinkKafkaConsumer<MyAvroClass> kafkaConsumer = ...;
var kafkaInputStream = env.addSource(kafkaConsumer, "kafkaInput");

Table table = tableEnv.fromDataStream(kafkaInputStream);
tableEnv.executeSql("DESCRIBE " + table).print();
...

MyAvroClass is Avro class which extends SpecificRecordBase and contains an array.
code for this class.

public class MyAvroClass extends SpecificRecordBase implements SpecificRecord {
  // avro fields
  private String event_id;
  private User user;
  private List<Item> items; 
  
  // getter, setters, constructors, builders, ...
}

I am unable to access elements of items field. When I print table description, I see that items is of type ANY

+------------+-------------------------------------------------------------+------+-----+--------+-----------+
|       name |                                                        type | null | key | extras | watermark |
+------------+-------------------------------------------------------------+------+-----+--------+-----------+
|   event_id |                                                      STRING | true |     |        |           |
|      items |                        LEGACY('RAW', 'ANY<java.util.List>') | true |     |        |           |
|       user |  LEGACY('STRUCTURED_TYPE', 'POJO<com.company.events.User>') | true |     |        |           |
+------------+-------------------------------------------------------------+------+-----+--------+-----------+  

How can I convert it to a type using which I can query the from items? Thanks in advance

warrior107
  • 709
  • 1
  • 9
  • 25

2 Answers2

0

I'm currently using this method for the purpose. The idea is SpecificRecord -(by AvroSerializationSchema)-> binary -(by AvroRowDeserializationSchema)-> Row. Note that you need to denote the type of the output of .map using .returns as advised in a comment of FLINK-23885.

public static <T extends SpecificRecord> Table toTable(StreamTableEnvironment tEnv,
                                                       DataStream<T> dataStream,
                                                       Class<T> cls) {
  RichMapFunction<T, Row> avroSpecific2RowConverter = new RichMapFunction<>() {
    private transient AvroSerializationSchema<T> avro2bin = null;
    private transient AvroRowDeserializationSchema bin2row = null;

    @Override
    public void open(Configuration parameters) throws Exception {
      avro2bin = AvroSerializationSchema.forSpecific(cls);
      bin2row = new AvroRowDeserializationSchema(cls);
    }

    @Override
    public Row map(T value) throws Exception {
      byte[] bytes = avro2bin.serialize(value);
      Row row = bin2row.deserialize(bytes);
      return row;
    }
  };

  SingleOutputStreamOperator<Row> rows = dataStream.map(avroSpecific2RowConverter)
    // https://issues.apache.org/jira/browse/FLINK-23885
    .returns(AvroSchemaConverter.convertToTypeInfo(cls));

  return tEnv.fromDataStream(rows);
}
eastcirclek
  • 107
  • 10
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 06 '21 at 11:01
0

I am experiencing a similar problem, where Flink Table's type interpolation failed to ingest java.util.List or java.util.Map, despite officially it's supported. I found a workaround (read: HACK) I'd like to share.

Step 1: When mapping your data to POJO, stick with fields you KNOW will interpolate correctly. In my case I had Map<String, String> that was failing with interpolation to LEGACY('RAW', ANY<java.util.Map>). I joined it into a single String (e.g., comma separated entries, where each entry is 'key:value'. They are joined into a single string).

Step 2: For your input data stream, make sure to transform it into DataStream[MY_POJO_TYPE].

Step 3: Go ahead and do Table table = tableEnv.fromDataStream(kafkaInputStream); as usual.

Step 4: Perform another transform on table with ScalarFunction. In my case, I wrote an user-defined scalar function that takes the String, and output Map<String, String>. Strangely enough, when interpolating Map AFTER data is in Table abstraction, Flink was able to interpolate the type properly into Flink MAP type.

Here's a rough example of what the user defined scalar function looks like (in java):

import java.util.Arrays;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.flink.table.functions.ScalarFunction;

public class TagsMapTypeScalarFunction extends ScalarFunction {

  // See
  // https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#scalar-functions for reference implementation and how interfacing with ScalarFunction works.
  public Map<String, String> eval(String s) {
    // input is comma delimited key:value pairs.
    return Arrays.stream(s.split(","))
        .filter(kv -> kv != "")
        .map(kv -> kv.split(":"))
        .filter(pair -> pair.length == 2)
        .filter(pair -> Arrays.stream(pair).allMatch(token -> token != ""))
        .collect(Collectors.toMap(pair -> pair[0].trim(), pair -> pair[1].trim()));
  }
}

Here's what the invocation roughly looks like (in Scala):


    //This table has a field "tags" which is the comma-delimited, key:value squished string.
    val transformedTable = tableEnv.fromDataStream(kafkaInputStream: DataStream[POJO])

    tableEnv.createTemporaryFunction(
      "TagsMapTypeScalarFunction",
      classOf[TagsMapTypeScalarFunction]
    );

    val anotherTransform =
      transformedTable
        .select($"*", call("TagsMapTypeScalarFunction", $"tags").as("replace_tags"))
        .dropColumns($"tags")
        .renameColumns($"replace_tags".as("tags"))

    anotherTransform

It certainly is a bit of "busy" work converting from map, to string, and back out as a map. But it beats being stuck.

wiw
  • 1