I need to read data with Apache Flink from Apache Kudu Database in realtime.
My use-case is: I receive a message from Kafka, deserialize that message and get an ID. If ID exists is in the database, I ignore it If isn't, I need to add it in there.
I'm loading data from Kudu with
tableEnvironment.sqlQuery("");
It works fine when I have old registers in database and load them from TableAPI, but if I keep Flink running, sent a message with new ID twice, TableAPI doesn't reflect the first one ID inserted.
I've tried to use TableAPI with ChangelogStream, but TableAPI keeps not updated. I've tried to use Watermark interval and Checkpoint interval of 10 seconds, but I've realized that my Table object keeps records only during interval. I've tried to run sqlQuery after insert new record, but with no effects I've tried to append new register into Table object directly, but I couldn't do that.
I would like Flink read data from Kudu continuously, either every 15 seconds or when there is new data in Kudu.
My code follows:
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().build();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
tEnv.getConfig().addConfiguration(
new Configuration()
.set(PipelineOptions.AUTO_WATERMARK_INTERVAL, Duration.ofSeconds(10))
.set(ExecutionCheckpointingOptions.CHECKPOINTING_INTERVAL, Duration.ofSeconds(20))
);
DataStream<GenericRecord> registerStream = ...
Schema schemaRegisterStream = Schema.newBuilder()
.column("ID", DataTypes.STRING())
.column("col2", DataTypes.STRING())
.column("col3", DataTypes.STRING())
.column("col4", DataTypes.STRING())
.column("col5", DataTypes.STRING())
.column("col6", DataTypes.STRING())
.build();
Table streamRegisterTable = tEnv.fromDataStream(registerStream, schemaRegisterStream);
Table kuduRegisters = tEnv.sqlQuery("SELECT * FROM `catalog`.`database.table`");
Table resultRegisterStreamJoinKudu = streamRegisterTable
.leftOuterJoin(kuduRegisters , Expressions.$("kudu_id").isEqual(Expressions.$("stream_ID")))
.where(Expressions.$("kudu_ID").isNull())
.select(Expressions.$("stream_ID"),
Expressions.$("col2"),
Expressions.$("col3"),
Expressions.$("col4"),
Expressions.$("col5"),
Expressions.$("col6"))
.as("ID", "col2", "col3", "col4", "col5", "col6");
//convert resultRegisterToDataStream
//add sink to DataStream