0

when the batch job finish, what will the ApplicationCluster state suppose to be? Is 'increase restartNonce' a by designed way to re-run the job?

i am trying to use flink operator to deploy a flink batch job, and trigger it with a kubernetes cronjob every day at a certain time

gix
  • 3
  • 2

2 Answers2

1

The operator is designed mostly with streaming jobs in mind but in theory batch jobs should also work.

When a batch job finishes (Flink 1.15 and above) the FlinkDeployment.status.jobStatus.state should go into FINISHED.

Bumping the restartNonce would resubmit the job, if you set the upgradeMode to stateless this would start it completely from fresh.

So in theory you could cron the bumping of the restartNonce but this is not a pattern we have tested or use in production ourselves.

Gyula
  • 66
  • 2
  • thank you very much for replying Gyula. there indeed exists some problem when using 'restartNoce'. Since i can not report issue here, https://github.com/apache/flink-kubernetes-operator , i summarize the problem below. – gix Nov 24 '22 at 06:27
  • ApplicationMode is failed, it seems when batch job finished, the task manager is recycled, and operator take this state as something is wrong, and want to recover it. So i tried session mode. – gix Nov 24 '22 at 06:44
0

test environment:

  • flink-kubernetes-operator v1.1
  • flink_1_14

operations:

  • deploy a session cluster
  • deploy a session job(batch job)
  • restart session job by modifying restartNonce

results:

  • first time, the session job can be started
  • when apply the restartNonce config, session job can not start and come across an error 'Exception occurred in REST handler: Job could not be found'

2022-11-24 06:20:29,079 is the time restartNonce be applied.

logs:

2022-11-24 06:14:30,314 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Streaming WordCount (24d8e9726de88ab201ea13d48e9cdc8e) switched from state RUNNING to FINISHED.
2022-11-24 06:14:30,314 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping checkpoint coordinator for job 24d8e9726de88ab201ea13d48e9cdc8e.
2022-11-24 06:14:30,315 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 24d8e9726de88ab201ea13d48e9cdc8e reached terminal state FINISHED.
2022-11-24 06:14:30,317 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Stopping the JobMaster for job 'Streaming WordCount' (24d8e9726de88ab201ea13d48e9cdc8e).
2022-11-24 06:14:30,317 INFO  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down
2022-11-24 06:14:30,317 INFO  org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [5a259aa9f56d090c4c4df02ca2e4f189].
2022-11-24 06:14:30,318 INFO  org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [7eb2fecceb9aff71e2daa4d358c8031a].
2022-11-24 06:14:30,318 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Close ResourceManager connection abe9ce776ee288f79d2e0a1921fb0896: Stopping JobMaster for job 'Streaming WordCount' (24d8e9726de88ab201ea13d48e9cdc8e).
2022-11-24 06:14:30,318 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@gix-flink-cluster.flink-examples:6123/user/rpc/jobmanager_4 for job 24d8e9726de88ab201ea13d48e9cdc8e from the resource manager.
2022-11-24 06:15:26,189 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Stopping worker gix-flink-cluster-taskmanager-1-3.
2022-11-24 06:15:26,189 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Stopping TaskManager pod gix-flink-cluster-taskmanager-1-3.
2022-11-24 06:15:26,189 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Closing TaskExecutor connection gix-flink-cluster-taskmanager-1-3 because: TaskExecutor exceeded the idle timeout.
2022-11-24 06:15:26,204 WARN  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Discard registration from TaskExecutor gix-flink-cluster-taskmanager-1-3 at (akka.tcp://flink@10.238.15.21:6122/user/rpc/taskmanager_0) because the framework did not recognize it
2022-11-24 06:15:26,626 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@10.238.15.21:6122] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2022-11-24 06:15:26,626 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@10.238.15.21:46779] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2022-11-24 06:20:29,079 ERROR org.apache.flink.runtime.rest.handler.job.JobCancellationHandler [] - Exception occurred in REST handler: Job could not be found.
2022-11-24 06:20:31,111 ERROR org.apache.flink.runtime.rest.handler.job.JobCancellationHandler [] - Exception occurred in REST handler: Job could not be found.
2022-11-24 06:20:33,122 ERROR org.apache.flink.runtime.rest.handler.job.JobCancellationHandler [] - Exception occurred in REST handler: Job could not be found.
2022-11-24 06:20:36,152 ERROR org.apache.flink.runtime.rest.handler.job.JobCancellationHandler [] - Exception occurred in REST handler: Job could not be found.
2022-11-24 06:20:40,663 ERROR org.apache.flink.runtime.rest.handler.job.JobCancellationHandler [] - Exception occurred in REST handler: Job could not be found.
2022-11-24 06:20:47,427 ERROR org.apache.flink.runtime.rest.handler.job.JobCancellationHandler [] - Exception occurred in REST handler: Job could not be found.
gix
  • 3
  • 2
  • What I suggested will only work on Flink 1.15 and up. Flink 1.14 does not yet have the required feature to keep around the jobmanager after the job finished in application mode. Due to this the operator never actually sees that the job finished, only that the cluster disappeard. But this could inidicate both finished and failed. – Gyula Nov 24 '22 at 10:04