Questions tagged [checkpointing]
105 questions
1
vote
2 answers
Is there any way to ensure all CheckpointListeners notified about checkpoint completion on Flink on job cancel with savepoint?
I'm using flink 1.9 and the REST API /jobs/:jobid/savepoints to trigger the savepoint and cancel job (stop the job gracefully to run later on from savepoint).
I use a two-phase commit in source function so my source implements both…

Mikalai Lushchytski
- 1,563
- 1
- 9
- 18
1
vote
2 answers
Too many ongoing snapshots. Increase kafka producers pool size or decrease number of concurrent checkpoints
I am working on a Flink application that sinks to Kafka. I created a Kafka producer that has default pool size of 5. I have enabled checkpoints with following config:
env.enableCheckpointing(1800000);//checkpointing for every 30 minutes.
//…

VSK
- 359
- 2
- 5
- 20
1
vote
1 answer
Since it is not "checkpoint", what is the standard method for crash-recovery to resume TensorFlow 2.0 Training?
To resume training after a crash, one must restore not only the model but all objects and parameters that go into the state of a model.fit(...) process.
Before I go bother to fork the keras code to implement a fitting object includes for example,…

user3673
- 665
- 5
- 21
1
vote
0 answers
How to take Statistics of Different Checkpoints at different Segment in gem5 checkpoints?
I have created some (say 10) checkpoints with a fixed interval in the ROI of the gem5 simulation for Parsec benchmark.
Then I tried restoring the checkpoints with the following command
./build/ALPHA/gem5.opt configs/example/fs.py -r 1
but I got…

Saurav Malla
- 11
- 1
1
vote
0 answers
Cannot complete snapshot errors in Apache Flink
I have a recurring problem after deployment, that I cannot reproduce locally. Would be happy to have your help. See logs:
[realtime-event-processor-flink-job-cluster-cc4d4b46c-8cghx]…

Coder2114
- 19
- 5
1
vote
2 answers
Apache Flink: Job recovery in IDE execution not working as expected
I have a sample streaming WordCount example written in Flink (Scala). In it, I want to use externalized checkpointing to restore in case of failure. But it is not working as expected.
My code is as follows:
object WordCount {
def main(args:…

himanshuIIITian
- 5,985
- 6
- 50
- 70
1
vote
1 answer
CRIU usage for Java application
So I want to use CRIU to make a snapshot of a JVM process and restore it later. For this purpose I wrote a little program which does nothing more but printing the counter every second:
package some;
public class Fun {
public static void…

Aksim Elnik
- 425
- 6
- 27
1
vote
1 answer
Save and load checkpoint pytorch
i make a model and save the configuration as:
def checkpoint(state, ep, filename='./Risultati/checkpoint.pth'):
if ep == (n_epoch-1):
print('Saving state...')
…

Marco Vicedomini
- 13
- 3
1
vote
1 answer
ImageProjectiveTransformV2 error in loading meta graph by import_meta_graph
I am trying to load meta graph of trained networks "name.ckpt-1.meta" using tf.train.import_meta_graph("./name.ckpt-1.meta")
but the following error appears:
Traceback (most recent call last):
File…

Ali
- 387
- 4
- 11
1
vote
0 answers
Spark structured streaming change of Kafka brokers - effect on checkpoint
We have a spark structured streaming application running in production with inhouse managed Kafka (lets call it kafka-inhouse)
We are deciding to migrate to aiven kafka cloud.
Assuming:
We consume all messages from kafka-inhouse, and then the new…

Chandan Bhattad
- 351
- 1
- 5
- 21
1
vote
2 answers
Stop and Restart Training on VGG-16
I am using pre-trained VGG-16 model for image classification. I am adding custom last layer as the number of my classification classes are 10. I am training the model for 200 epochs.
My question is: is there any way if I randomly stop (by closing…

Pervaiz Niazi
- 199
- 2
- 14
1
vote
1 answer
Does tf.train.CheckpointSaverHook in tf.train.MonitoredTrainingSession block training while checkpointing or it is done asynchronously?
I am pretty new in TensorFlow. I am currently curious to track the IO time and bandwidth (preferably percentage of IO time taken in the training process for checkpointing) for checkpointing which is performed by the internal checkpointing mechanism…

Fahim
- 11
- 2
1
vote
1 answer
Flink exactly once - checkpoint and barrier acknowledgement at sink
I have a Flink job with a sink that is writing the data into MongoDB. The sink is an implementation of RichSinkFunction.
Externalized checkpointing enabled. The interval is 5000 mills and scheme is EXACTLY_ONCE.
Flink version 1.3,
Kafka (source…

Mudit bhaintwal
- 528
- 1
- 7
- 21
1
vote
1 answer
What would happen if I configured a local file system for Flink checkpointing?
I have saw a video named Managing State in Apache Flink - Tzu-Li (Gordon) Tai.
In this video, it stores data with distributed file system.
I'm wondering that what would happen if I configured a local file system for Flink checkpointing?
eg:…

Brutal_JL
- 2,839
- 2
- 21
- 27
1
vote
1 answer
Docker Checkpoint/Restore with CRIU - Kernel doesn't support PTRACE_O_SUSPEND_SECCOMP
I am attempting to do a hello-world example of docker checkpoint/restore with CRIU (https://criu.org/Docker).
Here is the output from criu check --all
Error (criu/cr-check.c:648): Kernel doesn't support
PTRACE_O_SUSPEND_SECCOMP
Error…

DGardner42
- 31
- 5