0

When I use flink sql api process data.

Restart app, sum result not save in checkpoint.It's still start with 1.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StateBackend stateBackend = new FsStateBackend("file:///D:/d_backup/github/flink-best-practice/checkpoint");
env.enableCheckpointing(1000 * 60);
env.setStateBackend(stateBackend);

Table table = tableEnv.sqlQuery(
        "select sum(area_id) " +
        "from rtc_warning_gmys " +
        "where area_id = 1 " +
        "group by character_id,area_id,group_id,platform");

//   convert the Table into a retract DataStream of Row.
//   A retract stream of type X is a DataStream<Tuple2<Boolean, X>>.
//   The boolean field indicates the type of the change.
//   True is INSERT, false is DELETE.
DataStream<Tuple2<Boolean, Row>> dsRow = tableEnv.toRetractStream(table, Row.class);
dsRow.map(new MapFunction<Tuple2<Boolean,Row>, Object>() {
    @Override
    public Object map(Tuple2<Boolean, Row> booleanRowTuple2) throws Exception {
        if(booleanRowTuple2.f0) {
            System.out.println(booleanRowTuple2.f1.toString());
            return booleanRowTuple2.f1;
        }
        return null;
    }
});

env.execute("Kafka table select");

Log as:

1 2 3 ... ... 100

Restart app it still start: 1 2 3 ...

I think sum value will be stored in checkpint file and restart app can read last result from checkpint like:

101 102 103 ... 120

Oliv
  • 10,221
  • 3
  • 55
  • 76

1 Answers1

3

Some possibilities:

  • Did the job run long enough to complete a checkpoint? Just because the job produced output doesn't mean that a checkpoint was completed. I see you have checkpointing configured to occur once a minute, and the checkpoints could take some time to complete.

  • How was the job stopped? Unless they have been externalized, checkpoints are deleted when a job is cancelled.

  • How was the job restarted? Did it recover (automatically) from a checkpoint, or was it resumed from an externalized checkpoint or savepoint, or was it restarted from scratch?

This sort of experiment is easiest to do via the command line. You might, for example,

  1. write an app that uses checkpoints, and has a restart strategy (e.g., env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000, 1000)))
  2. start a local cluster
  3. "flink run -d app.jar" to start the job
  4. wait until at least one checkpoint has completed
  5. "kill -9 task-manager-PID" to cause a failure
  6. "taskmanager.sh start" to allow the job to resume from the checkpoint
David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • `1`. I am sure it has been completed, and checkpoint file did not clear. (logs:Triggering checkpoint 1 @ 1545993069375 for job f58df49e5a9172aad371d3e393c4be36.Completed checkpoint 1 for job f58df49e5a9172aad371d3e393c4be36 (28443 bytes in 308 ms).); `2`. I run job in IDE(Idea), click stop button to stop job; `3`.I restarted job is clicking start button for that. `4`. I am not sure how to set restart point when I run code in IDE(Idea).On cluster like this:`bin/flink run -s checkpoint path flink-app-jobs.jar` – wxm imperio Dec 28 '18 at 11:40
  • SQL queries are executed as regular applications by Flink. Flink only recovers jobs automatically if the cluster keeps running. If you cancel the job in the IDE, the IDE-embedded cluster is terminated. You have to follow the steps described by David (starting a local cluster, starting a job, canceling a TM, starting a TM) to recover a job. In order to start an application on a new cluster, you have to start from a "savepoint". – Fabian Hueske Dec 28 '18 at 17:08