Questions tagged [checkpointing]

105 questions
1
vote
2 answers

Is there any way to ensure all CheckpointListeners notified about checkpoint completion on Flink on job cancel with savepoint?

I'm using flink 1.9 and the REST API /jobs/:jobid/savepoints to trigger the savepoint and cancel job (stop the job gracefully to run later on from savepoint). I use a two-phase commit in source function so my source implements both…
Mikalai Lushchytski
  • 1,563
  • 1
  • 9
  • 18
1
vote
2 answers

Too many ongoing snapshots. Increase kafka producers pool size or decrease number of concurrent checkpoints

I am working on a Flink application that sinks to Kafka. I created a Kafka producer that has default pool size of 5. I have enabled checkpoints with following config: env.enableCheckpointing(1800000);//checkpointing for every 30 minutes. //…
VSK
  • 359
  • 2
  • 5
  • 20
1
vote
1 answer

Since it is not "checkpoint", what is the standard method for crash-recovery to resume TensorFlow 2.0 Training?

To resume training after a crash, one must restore not only the model but all objects and parameters that go into the state of a model.fit(...) process. Before I go bother to fork the keras code to implement a fitting object includes for example,…
user3673
  • 665
  • 5
  • 21
1
vote
0 answers

How to take Statistics of Different Checkpoints at different Segment in gem5 checkpoints?

I have created some (say 10) checkpoints with a fixed interval in the ROI of the gem5 simulation for Parsec benchmark. Then I tried restoring the checkpoints with the following command ./build/ALPHA/gem5.opt configs/example/fs.py -r 1 but I got…
1
vote
0 answers

Cannot complete snapshot errors in Apache Flink

I have a recurring problem after deployment, that I cannot reproduce locally. Would be happy to have your help. See logs: [realtime-event-processor-flink-job-cluster-cc4d4b46c-8cghx]…
Coder2114
  • 19
  • 5
1
vote
2 answers

Apache Flink: Job recovery in IDE execution not working as expected

I have a sample streaming WordCount example written in Flink (Scala). In it, I want to use externalized checkpointing to restore in case of failure. But it is not working as expected. My code is as follows: object WordCount { def main(args:…
himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
1
vote
1 answer

CRIU usage for Java application

So I want to use CRIU to make a snapshot of a JVM process and restore it later. For this purpose I wrote a little program which does nothing more but printing the counter every second: package some; public class Fun { public static void…
Aksim Elnik
  • 425
  • 6
  • 27
1
vote
1 answer

Save and load checkpoint pytorch

i make a model and save the configuration as: def checkpoint(state, ep, filename='./Risultati/checkpoint.pth'): if ep == (n_epoch-1): print('Saving state...') …
1
vote
1 answer

ImageProjectiveTransformV2 error in loading meta graph by import_meta_graph

I am trying to load meta graph of trained networks "name.ckpt-1.meta" using tf.train.import_meta_graph("./name.ckpt-1.meta") but the following error appears: Traceback (most recent call last): File…
Ali
  • 387
  • 4
  • 11
1
vote
0 answers

Spark structured streaming change of Kafka brokers - effect on checkpoint

We have a spark structured streaming application running in production with inhouse managed Kafka (lets call it kafka-inhouse) We are deciding to migrate to aiven kafka cloud. Assuming: We consume all messages from kafka-inhouse, and then the new…
1
vote
2 answers

Stop and Restart Training on VGG-16

I am using pre-trained VGG-16 model for image classification. I am adding custom last layer as the number of my classification classes are 10. I am training the model for 200 epochs. My question is: is there any way if I randomly stop (by closing…
1
vote
1 answer

Does tf.train.CheckpointSaverHook in tf.train.MonitoredTrainingSession block training while checkpointing or it is done asynchronously?

I am pretty new in TensorFlow. I am currently curious to track the IO time and bandwidth (preferably percentage of IO time taken in the training process for checkpointing) for checkpointing which is performed by the internal checkpointing mechanism…
Fahim
  • 11
  • 2
1
vote
1 answer

Flink exactly once - checkpoint and barrier acknowledgement at sink

I have a Flink job with a sink that is writing the data into MongoDB. The sink is an implementation of RichSinkFunction. Externalized checkpointing enabled. The interval is 5000 mills and scheme is EXACTLY_ONCE. Flink version 1.3, Kafka (source…
Mudit bhaintwal
  • 528
  • 1
  • 7
  • 21
1
vote
1 answer

What would happen if I configured a local file system for Flink checkpointing?

I have saw a video named Managing State in Apache Flink - Tzu-Li (Gordon) Tai. In this video, it stores data with distributed file system. I'm wondering that what would happen if I configured a local file system for Flink checkpointing? eg:…
Brutal_JL
  • 2,839
  • 2
  • 21
  • 27
1
vote
1 answer

Docker Checkpoint/Restore with CRIU - Kernel doesn't support PTRACE_O_SUSPEND_SECCOMP

I am attempting to do a hello-world example of docker checkpoint/restore with CRIU (https://criu.org/Docker). Here is the output from criu check --all Error (criu/cr-check.c:648): Kernel doesn't support PTRACE_O_SUSPEND_SECCOMP Error…
DGardner42
  • 31
  • 5