0

in aggregation to this question I'm still not having clear why the checkpoints of my Flink job grows and grows over time and at the moment, for about 7 days running, these checkpoints never gets the plateau. I'm using Flink 1.10 version at the moment, FS State Backend as my job cannot afford the latency costs of using RocksDB.

See the checkpoints evolve over 7 days: enter image description here Let's say that I have this configuration for the TTL of the states in all my stateful operators for one hour or maybe more than that and a day in one case:

public static final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(1))
            .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
            .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
            .cleanupFullSnapshot().build();

In my concern all the objects into the states will be cleaned up after the expires time and therefore the checkpoints size should be reduced, and as we expect more or less the same amount of data everyday.

In the other hand we have a traffic curve, which has more incoming data in some hours of the day, but late night the traffic goes down and all the objects into the states that expires should be cleaned up causing that the checkpoint size should be reduced not kept with the same size until the traffic goes up again.

Let's see this code sample of one use case:

DataStream<Event> stream = addSource(source);
KeyedStream<Event, String> keyedStream = stream.filter((FilterFunction<Event>) event ->
                    apply filters here;))
                    .name("Events filtered")
                    .keyBy(k -> k.rType.equals("something") ? k.id1 : k.id2);
keyedStream.flatMap(new MyFlatMapFunction())


public class MyFlatMapFunction extends RichFlatMapFunction<Event, Event>{
private final MapStateDescriptor<String, Event> descriptor = new MapStateDescriptor<>("prev_state", String.class, Event.class);
private MapState<String, Event> previousState;

@Override
    public void open(Configuration parameters) {
        /*ttlConfig described above*/
        descriptor.enableTimeToLive(ttlConfig);
        previousState = getRuntimeContext().getMapState(descriptor);
    }

@Override
    public void flatMap(Event event, Collector<Event> collector) throws Exception {
      final String key = event.rType.equals("something") ? event.id1 : event.id2;
      Event previous = previousState.get(key);
      if(previous != null){
        /*something done here*/
      }else /*something done here*/
        previousState.put(key, previous);
        collector.collect(previous);
 }
}

More or less these is the structure of the use cases, and some others that uses Windows(Time Window or Session Window)

Questions:

  • What am I doing wrong here?
  • Are the states cleaned up when they expires and this scenario which is the same of the rest of the use cases?
  • What can help me to fix the checkpoint size if they are working wrong?
  • Is this behaviour normal?

Kind regards!

Alter
  • 903
  • 1
  • 11
  • 27
  • Is the JVM heap growing in a similar fashion as well, or just the checkpoint sizes? – David Anderson Sep 02 '20 at 19:23
  • No, JVM Heap is fine. Which means that the GC is working as expected right? Thanks. – Alter Sep 02 '20 at 20:16
  • 1
    I'm unable to formulate a theory of your application that explains all of the facts you've shared. Something doesn't add up. Given that I don't understand the situation well enough, I hesitate to offer any advice. – David Anderson Sep 04 '20 at 17:24
  • There's an experiment you could do that might shed some light on what's going on with the checkpoints growing in size. If you restore (a copy of) the job from a checkpoint, and disabled the input(s), then after an hour the checkpoint size should drop to zero. – David Anderson Sep 04 '20 at 17:28
  • Allows me to ask you another question: My job consist in to read events from RabbitMQ -> Transformations -> Sink, I'm not using Restart Strategies because if my job fails SVC service will manage the automatic restart of the job and it will start from scratch. Starting from this operational mode, checkpoints makes no senses to me, because they are not been used to any restart in fail cases, and according to my understanding they are worthy if Flink application crashes and then restart from checkpoints, but as my job always start from scratch: Do I need the checkpoints in my scenario? – Alter Sep 04 '20 at 17:55
  • As I can't restore my job from any checkpoint because I'm running my job as a single Java application `java -jar jobName.jar`. – Alter Sep 04 '20 at 18:00
  • Even when running as a single Java application, your jobs can recover from checkpoints so long as the failures don't bring down Flink itself. If the Job Manager and Task Manager are still running, the job will restart from the latest checkpoint unless you have disabled checkpointing, or have set a noRestart restart strategy. – David Anderson Sep 04 '20 at 19:43
  • By decision of the company I have set a noRestart restart strategy. So I think in my scenario I don't need checkpoints right? because any single fail will sent Flink to shutdown `System.exit(1);` as company decision at the moment. – Alter Sep 04 '20 at 19:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220991/discussion-between-david-anderson-and-alejandro-deulofeu). – David Anderson Sep 04 '20 at 19:48

1 Answers1

3

In this stretch of code it appears that you are simply writing back the state that was already there, which only serves to reset the TTL timer. This might explain why the state isn't being expired.

Event previous = previousState.get(key);
if (previous != null) {
  /*something done here*/
} else
  previousState.put(key, previous);

It also appears that you should be using ValueState rather than MapState. ValueState effectively provides a sharded key/value store, where the keys are the keys used to partition the stream in the keyBy. MapState gives you a nested map for each key, rather than a single value. But since you are using the same key inside the flatMap that you used to key the stream originally, key-partitioned ValueState would appear to be all that you need.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Thanks a lot David. This answer clarifies all to me. Kind regards – Alter Sep 02 '20 at 20:15
  • Allows me to ask one more thing related to this: It is possible that this same issue with the values that never expires can be the root of the CPU load is increasing over time as checkpoints do? Because it causes more effort of CPU to complete the checkpoints? Thanks one more time. – Alter Sep 02 '20 at 20:58
  • In general the state backend may be spending time trying to expire things that aren't expiring, not just during checkpointing, but continuously. But I would expect to see the heap growing, if the keyspace is growing. – David Anderson Sep 03 '20 at 07:49
  • @DavidAnderson I am using flink 1.10 with RocksDb as state-backend and I have observed similar graph of checkpoint size gradually increasing, after running for few days(based on total flink TM memory given) the job restarts with heap usage considerable increased, is this a normal behaviour? – Debu Sep 14 '21 at 12:30
  • With RocksDB it's normal for the checkpoint sizes to gradually increase up until RocksDB triggers compaction. Depending on your data volumes and update frequency, this could take a long time. As for having a large heap, I wouldn't expect that with RocksDB, but is certainly possible -- depends on what your job is doing, and how it's configured. – David Anderson Sep 14 '21 at 16:56
  • @DavidAnderson its simply an aggregation(using time-window) job where we are reading events from kafka, aggregating the data and push it to sink, but as I told u the job keeps restarting after N number of days, so far we have only configured taskmanager.memory.flink.size: 3072m rest all are default – Debu Sep 16 '21 at 06:57
  • also we are using incremental checkpointing with rocksdb – Debu Sep 16 '21 at 08:12
  • @Debu On Flink 1.10 it's not unusual for RocksDB to run out of memory -- but not heap. This was partially addressed in 1.11, and should be fixed in 1.14. It's unusual to run out of heap with RocksDB, unless perhaps you are using on-heap timers. – David Anderson Sep 16 '21 at 16:42