Flink 1.12.x DataSet --> Flink 1.14.x DataStream

Question

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.

Our Code using Flink 1.12.x DataSet

dataset
    .<few operations>
    .mapPartition(new SomeMapParitionFn())
    .<few more operations>

public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {

    @Override
    public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
        for (InputModel record : records) {
            /*
            do some operation    
             */
            if (/* some condition based on processing *MULTIPLE* records */) {
                out.collect(...); // Conditional collect                ---> (1)
            }
        }
        
        // At the end of the data, collect
        out.collect(...);   // Collect processed data                   ---> (2) 
    }
}

(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data

Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.

https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers

How to implement this in Flink 1.14.x DataStream? Please advise...

Note: Our application works with only finite set of data (Batch Mode)

score 1 · Answer 1 · answered Feb 09 '22 at 08:01

1

In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.

In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility

answered Feb 09 '22 at 08:01

ChangLi

772
2
8

Thanks for quick response. I read this thread (https://stackoverflow.com/questions/33401332/apache-flink-datastream-api-doesnt-have-a-mappartition-transformation) already. Main problem is second Collector.collect to be called at end of the data. Collector not available in Close method – Saravanan Feb 09 '22 at 08:50
I don't think "Flink's internal stream processing model requires that user functions return in order to checkpoint function state" is valid because Flink came up with unified api for both batch & streaming. – Saravanan Feb 09 '22 at 08:55
For those interested in this question, it was also posted to the Flink User mailing list, so there could be answers there too: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z – Martijn Visser Feb 10 '22 at 07:55

score 0 · Answer 2 · answered Feb 16 '22 at 03:14

Solution provided by Zhipeng

One solution could be using a streamOperator to implement BoundedOneInput interface. An example code could be found here [1].

[1] https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75

Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

Our Code using Flink 1.12.x DataSet

2 Answers2