Finding missing records from 2 data sources with Flink

Question

I have two data sources - an S3 bucket and a postgres database table. Both sources have records in the same format with a unique identifier of type uuid. Some of the records present in the S3 bucket are not part of the postgres table and the intent is to find those missing records. The data is bounded as it is partitioned by every day in the s3 bucket.

Reading the s3-source (I believe this operation is reading the data in batch mode since I am not providing the monitorContinuously() argument) -


    final FileSource<GenericRecord> source = FileSource.forRecordStreamFormat(
                                             AvroParquetReaders.forGenericRecord(schema), path).build();
    
    final DataStream<GenericRecord> avroStream = env.fromSource(
                                                 source, WatermarkStrategy.noWatermarks(), "s3-source");
    
    DataStream<Row> s3Stream = avroStream.map(x -> Row.of(x.get("uuid").toString()))
                                      .returns(Types.ROW_NAMED(new String[] {"uuid"}, Types.STRING));
    
    Table s3table = tableEnv.fromDataStream(s3Stream); 
    tableEnv.createTemporaryView("s3table", s3table);

For reading from Postgres, I created a postgres catalog -

    PostgresCatalog postgresCatalog = (PostgresCatalog) JdbcCatalogUtils.createCatalog(
            catalogName,
            defaultDatabase,
            username,
            pwd,
            baseUrl);
    
    tableEnv.registerCatalog(postgresCatalog.getName(), postgresCatalog);
    tableEnv.useCatalog(postgresCatalog.getName());
    
    Table dbtable = tableEnv.sqlQuery("select cast(uuid as varchar) from `localschema.table`");
    tableEnv.createTemporaryView("dbtable", dbtable);

My intention was to simply perform left join and find the missing records from the dbtable. Something like this -

    Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.uuid = dbtable.uuid where dbtable.uuid is null");
    DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
    resultStream.print();

However, it seems that the UUID column type is not supported just yet because I get the following exception.

Caused by: java.lang.UnsupportedOperationException: Doesn't support Postgres type 'uuid' yet
    at org.apache.flink.connector.jdbc.dialect.psql.PostgresTypeMapper.mapping(PostgresTypeMapper.java:171)

As an alternative, I tried to read the database table as follows -

    TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] {
            BasicTypeInfo.of(String.class)
    };
    RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
    JdbcInputFormat jdbcInputFormat = JdbcInputFormat.buildJdbcInputFormat()
                                              .setDrivername("org.postgresql.Driver")
                                              .setDBUrl("jdbc:postgresql://127.0.0.1:5432/localdatabase")
                                              .setQuery("select cast(uuid as varchar) from localschema.table")
                                              .setUsername("postgres")
                                              .setPassword("postgres")
                                              .setRowTypeInfo(rowTypeInfo)
                                              .finish();

    DataStream<Row> dbStream = env.createInput(jdbcInputFormat);

    Table dbtable = tableEnv.fromDataStream(dbStream).as("uuid");
    tableEnv.createTemporaryView("dbtable", dbtable);

Only this time, I get the following exception on performing the left join (as above) -

Exception in thread "main" org.apache.flink.table.api.TableException: Table sink '*anonymous_datastream_sink$3*' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin]

It works if I tweak the resultStream to publish the changeLogStream -

Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.sync_id = dbtable.sync_id where dbtable.sync_id is null");

DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
resultStream.print();

Sample O/P

+I[9cc38226-bcce-47ce-befc-3576195a0933, null]
+I[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
+I[da6f57c8-3ad1-4df5-9636-c6b36df2695f, null]
+I[2f3845c1-6444-44b6-b1e8-c694eee63403, null]
-D[9cc38226-bcce-47ce-befc-3576195a0933, null]
-D[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]

However, I do not want the sink to have Inserts and Deletes as separate. I want just the final list of missing uuids. I guess it happens because my Postgres Source created with DataStream<Row> dbStream = env.createInput(jdbcInputFormat); is a streaming source. If I try to execute the whole application in BATCH mode, I get the following exception -

org.apache.flink.table.api.ValidationException: Querying an unbounded table '*anonymous_datastream_source$2*' in batch mode is not allowed. The table source is unbounded.

Is it possible to have a bounded JDBC source? If not, how can I achieve this using the streaming API. (using Flink version - 1.15.2)

I believe this kind of case would be a common usecase that can be implemented with Flink but clearly I'm missing something. Any leads would be appreciated.

Merijn · Answer 1 · 2023-01-17T16:50:01.947

0

For now common approach would be to sink the resultStream to a table. So you can schedule a job which truncates the table and then executes the Apache Flink job. And then read the results from this table.

I also noticed Apache Flink Table Store 0.3.0 is just released. And they have materialized views on the roadmap for 0.4.0. This might be a solution too. Very exciting imho.

edited Jan 17 '23 at 16:50

answered Jan 16 '23 at 14:11

Merijn

525
4
12

Finding missing records from 2 data sources with Flink

1 Answers1