0

I was wondering if in anyway Flink's datastream API be used to remove duplicates from records coming in (may be over a particular time window), just like in Dataset API which provides with a transformation called "Distinct". Or in anyway if dataset can be transformed to datastream, given that Dataset is converted to datastream for internal processing in Flink.

Please help me in this. Thanks in Advance! Cheers!

Anish Sarangi
  • 172
  • 1
  • 14

1 Answers1

0

I'm not aware of any built-in primitive, but if all data within the window fits into memory, then you can easily build this function yourself.

DataStream<...> stream = ...
stream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
    .process(new DistinctFunction<>());

public class DistinctFunction<T, W extends Window> extends ProcessAllWindowFunction<T, T, W> implements Function {
    public void process(final Context context, Iterable<T> input, Collector<R> out) throws Exception {
        Set<T> elements = new HashSet<>();
        input.forEach(elements::add);
        elements.forEach(out::collect);
    }
}

Of course, it's much more scalable if you have a key, as then only the data of one key in the window needs to be held in memory.

Arvid Heise
  • 3,524
  • 5
  • 11