0

I have checkpointing setup in my Flink job, and it has 2 sliding windows (these arent joins) and 1 tumbling window join. The idea is that I don't really need to save the state for the join itself as saving the state for the 2 sliding windows itself is enough. The Join ends up being a 20-30gb state causing the job to lag and crash, and the checkpoint never ends up saving.

How can I accomplish this?

I am trying something like:

public class CustomJoin implements JoinFunction<A, A, A>, ListCheckPointed<A> {

@Override
public A join(A a, A b){
 // Some irrelevant join logic
}

@Override
    public List<A> snapshotState(long l, long l1) throws Exception {
      return new ArrayList<>();
    }

    @Override
    public void restoreState(List<A> list) throws Exception {

    }
}

Does this actually avoid storing state for join? Its called like:

stream
.assignTimestampsAndWatermarks(...)
.join(secondStream.assingTimestampsAndWatermarks(...))
.where(KeySelector...)
.equalTo(KeySelector...)
.window(TumblingEventTimeWindows.of(Time.minutes(1L))
.trigger(EventTimeTrigger.create())
.apply(new CustomJoin());

Does this work in practice? Whats the best way to avoid storing state?

2 Answers2

0

according to my understanding of Flink, checkpoint needs to ensure that the entire calculation can be recovered safely and effectively, so this global state is inevitable. But Flink's own checkpoint can be closed (it's based on the ABS algorithm, which has little performance loss, I don't recommend it), but uses the SavePoint provided by Flink for custom snapshots, but Flink checkpoint is incremental. Save, and SavePoint is a full save. I would suggest that you look at these materials: 1、Distributed Snapshots-Determining Global States of a Distributed System 2、Lightweight Asynchronous Snapshots for Distributed Dataflows 3、https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/state/checkpointing.html I think this can solve your problem very well.

0

In a windowed join, the JoinFunction is executed by the window operator. It doesn't have its own state. So what you're trying isn't going to help.

Moreover, sliding windows use a lot more state than you may realize. Each overlapping instance has its own copy of the window's contents. So, for example, if you have hour-long windows that slide by 1 minute, then each event is copied 60 times.

David Anderson
  • 39,434
  • 4
  • 33
  • 60