0

I am reading from 2 streams. One with record and One with metadata.

For the first time I want my application to build metadata by scanning complete table and save it to Flink's MapState. Updates on the table will be captured via metadata stream and MapState will be updated accordingly.

From second time, I want to use the MapState instead of reading entire table.

Below is my implementation of this functionality, but my MapState is always empty, am I doing something wrong here ?

public class CustomCoFlatMap extends RichCoFlatMapFunction<Record, Metadata, Output> {

    private transient DataSource datasource;
    private transient MapState<String, Metadata> metadataState;

    @Inject
    public void setDataSource(DataSource datasource) {
        this.datasource = datasource;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        final RichFunctionComponent component = DaggerRichFunctionComponent.builder()
                .richFunctionModule(RichFunctionModule.builder()
                        .runtimeContext(getRuntimeContext())
                        .build())
                .build();
        component.inject(this);

        // read MapState from snapshot
        metadataState = getRuntimeContext().getMapState(new MapStateDescriptor<String, Cluster>("metadataState",
                TypeInformation.of(new TypeHint<String>(){}), TypeInformation.of(new TypeHint<Metadata>() {})));
    }

    @Override
    public void flatMap2(Metadata metadata, Collector<Output> collector) throws Exception {
        // this should happen only when application starts for first time
        // from next time, application will read from snapshot
        readMetadataForFirstTime();

        // update metadata in MapState
        this.metadataState.put(metadata.getId(), metadata);
    }

    @Override
    public void flatMap1(Record record, Collector<Output> collector) throws Exception {
        
        readMetadataForFirstTime();
        
        Metadata metadata = this.metadataState.get(record.getId());

        Output output = new Output(record.getId(), metadataState.getName(), metadataState.getVersion(), metadata.getType());
    
        collector.collect(output);
    }

    private void readMetadataForFirstTime() throws Exception {
        if(this.metadataState.iterator().hasNext()) {
            // metadataState from snapshot has data
            // not reading from table
            return;
        }

        // do this only once
        // read metadata from table and add it to MapState
        List<Metadata> metadataList = datasource.listAllMetadata();
        for(Metadata metadata: metadataList) {
            this.metadataState.put(metadata.getid(), metadata);
        }

    }
}

EDIT: Rest of the applicaiton

DataStream<Metadata> metadataKeyedStream =
                env.addSource(metadataStream)
                        .keyBy(Metadata::getId);

SingleOutputStreamOperator<Output> outputStream =
                env.addSource(recordStream)
                        .assignTimestampsAndWatermarks(new RecordTimeExtractor())
                        .keyBy(Record::getId)
                        .connect(metadataKeyedStream)
                        .flatMap(new CustomCoFlatMap());
Logic
  • 2,230
  • 2
  • 24
  • 41

1 Answers1

1

MapState is a kind of key-partitioned state -- meaning that Flink is maintaining a separate Map<String, Metadata> for every distinct key in the input stream. readMetadataForFirstTime will have to read and insert its data for every key in the stream being processed by your RichCoFlatMapFunction, since there is a separate map for every key.

You may want to approach this differently, depending on exactly what you are trying to do. For example, if you just want to store one value for every key in the source stream, then you should use ValueState rather than MapState. You can think of ValueState as a sharded key/value store, where each parallel instance of your stateful operator (e.g., a RichCoFlatMapFunction) will have the values for a slice of the key space. MapState is for the case where you need to store an entire hashmap for every key, rather than a single object.

(If I've misjudged where the problem lies, please share more context showing how rest of the application is using this RichCoFlatMapFunction.)

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Makes sense to have a `ValueState` instead of `MapState` as I only have one value for each key. But how can write a `Metadata` to value state for each Id ? Should I save with `ValueStateDescriptor` name as `Id` and value as `Metadata` ? If yes then should I do `getRuntimeContext().getState(new ValueStateDescriptor<>(metadata.getId(), TypeInformation.of(new TypeHint() {})));` for each record received in `flatMap1` and `flatMap2` – Logic Nov 21 '20 at 17:47
  • I suggest you look at the Flink training materials in https://ci.apache.org/projects/flink/flink-docs-stable/learn-flink/etl.html#stateful-transformations and https://github.com/apache/flink-training/tree/release-1.11/rides-and-fares, which explains how things work in some detail, and also includes some examples. – David Anderson Nov 21 '20 at 20:08
  • The state descriptor name should be a constant string; something like "metadata" would be fine. And if Metadata is a POJO, then you could simply do something like `new ValueStateDescriptor<>("metadata", Metadata.class)`. – David Anderson Nov 21 '20 at 20:11
  • Got it, I wanted to know how will I put all `Metadata` into `ValueState` i.e., when I call `readMetadataForFirstTime()`, I want to add the list of `Metadata` into `ValueState`. But I only see update on `ValueState`, how do I add all `Metadata` list into `ValueState ? – Logic Nov 22 '20 at 06:01
  • In other words I want a read all entries in table `Metadata` and save them to state only when the application starts for first time. When application restarts, I want to retrieve them from state instead of reading the table again. – Logic Nov 22 '20 at 07:23
  • In `readMetadataForFirstTime`, use `metadataState.value()` to check if the value for this key is null, and if it is, get the value for this key from the datasource. A better approach might be to bootstrap the state for all keys using the State Processor API. https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html – David Anderson Nov 22 '20 at 18:17