0

I am dealing with a stream of database mutations, i.e., a change log stream. I want to able to transform the values using a SQL query. I am having difficulty putting together the following three concepts RowTypeInfo, Row, and DataStream.

NOTE: I don't know the schema beforehand. I construct it on-fly using the data within the Mutation object (Mutation is a custom type)

More specifically I have code that looks like this.

val execEnv = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(execEnv)

// Mutation is a custom type
val mutationStream: DataStream[Mutation] = ...
// toRows returns an object of type org.apache.flink.types.Row
val rowStream:DataStream[Row] = mutationStream.flatMap({mutation => toRows(mutation)})
tableEnv.registerDataStream("spinal_tap_table", rowStream)
tableEnv.sql("select col1 + 2")

NOTE: Row object is positional, and doesn't have a placeholder for column names. I couldn't find a place to attach the schema to the DataStream object.

I want to pass some sort of a struct similar to Row that contains the complete information {columnName: String, columnValue: Object, columnType: TypeInformation[_]} for the query.

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
user758988
  • 67
  • 1
  • 7

1 Answers1

3

In Flink SQL a table schema is mandatory when the Table defined. It is not possible to run queries on dynamically typed records.

Regarding the concepts of RowTypeInfo, Row and DataStream:

  • Row is the actual record that holds the data
  • RowTypeInfo is a schema description for Rows. It contains names and TypeInformation for each field of a Row.
  • DataStream is a logical stream of records. A DataStream[Row] is a stream of rows. Note that this is not the actual stream but just an API concept to represent a stream in the API.
Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
  • But the code snippet compiles - shouldn't it fail asking for schema? I guess my question is where in the line line do I attach the RowTypeInfo? Mutation is my custom type, and I can convert it to a row using the toRows function. `val rowStream:DataStream[Row] = mutationStream.flatMap({mutation => toRows(mutation)})` – user758988 Feb 14 '18 at 20:23
  • 1
    Oh I see. The compiler can only check the static types but not look into the fields of Row. That's a difference when using Row instead of Tuple which has the field types defined by generic types. You can attach a `RowTypeInfo` to any operater with the `returns()` method: `in.map(...).returns(Types.ROW(Types.STRING, Types.INT))`. – Fabian Hueske Feb 14 '18 at 20:33
  • Thanks for the reply! map and flatMap methods (operators?) both return a DataStream object that doesn't have a `returns` method. I noticed that `StreamTransformation` interface has the `returns` method. How do I extend and pass it instead of `flatmap`? Should I even be doing that? – user758988 Feb 14 '18 at 21:47
  • Oh, sorry. `returns()` is only available for the Java DataStream API but you are using Scala. In Scala you can pass the `TypeInformation` as implicit value, i.e, `implicit val rowType: TypeInformation[Row] = Types.ROW(...)` – Fabian Hueske Feb 14 '18 at 22:15
  • I see. The scala version of flatMap returns DataStream without `returns()` method. But the java one returns `SingleOutputStreamOperator`. That seems to make the compiler happy. `new DataStream[Row](mutationStream.javaStream.flatMap(mutationToRows).returns(rowTypeInfo))` – user758988 Feb 14 '18 at 22:26
  • ^^Just saw your comment! Thanks Fabian – user758988 Feb 14 '18 at 22:39