Highest Voted 'checkpointing' Questions

1

vote

2 answers

Is there any way to ensure all CheckpointListeners notified about checkpoint completion on Flink on job cancel with savepoint?

I'm using flink 1.9 and the REST API /jobs/:jobid/savepoints to trigger the savepoint and cancel job (stop the job gracefully to run later on from savepoint). I use a two-phase commit in source function so my source implements both…

asked Aug 07 '20 at 09:20

Mikalai Lushchytski

1,563
1
9
18

1

vote

2 answers

Too many ongoing snapshots. Increase kafka producers pool size or decrease number of concurrent checkpoints

I am working on a Flink application that sinks to Kafka. I created a Kafka producer that has default pool size of 5. I have enabled checkpoints with following config: env.enableCheckpointing(1800000);//checkpointing for every 30 minutes. //…

java kubernetes apache-kafka apache-flink checkpointing

asked Mar 20 '20 at 22:39

VSK

359
2
5
20

1

vote

1 answer

Since it is not "checkpoint", what is the standard method for crash-recovery to resume TensorFlow 2.0 Training?

To resume training after a crash, one must restore not only the model but all objects and parameters that go into the state of a model.fit(...) process. Before I go bother to fork the keras code to implement a fitting object includes for example,…

tensorflow2.0 checkpointing

asked Dec 28 '19 at 22:10

user3673

665
5
21

1

vote

0 answers

How to take Statistics of Different Checkpoints at different Segment in gem5 checkpoints?

I have created some (say 10) checkpoints with a fixed interval in the ROI of the gem5 simulation for Parsec benchmark. Then I tried restoring the checkpoints with the following command ./build/ALPHA/gem5.opt configs/example/fs.py -r 1 but I got…

gem5 checkpointing

asked Jun 18 '19 at 05:21

Saurav Malla

11
1

1

vote

0 answers

Cannot complete snapshot errors in Apache Flink

I have a recurring problem after deployment, that I cannot reproduce locally. Would be happy to have your help. See logs: [realtime-event-processor-flink-job-cluster-cc4d4b46c-8cghx]…

apache-flink snapshot checkpointing

asked Apr 21 '19 at 08:28

Coder2114

19
5

1

vote

2 answers

Apache Flink: Job recovery in IDE execution not working as expected

I have a sample streaming WordCount example written in Flink (Scala). In it, I want to use externalized checkpointing to restore in case of failure. But it is not working as expected. My code is as follows: object WordCount { def main(args:…

apache-flink flink-streaming checkpointing

asked Apr 14 '19 at 20:06

himanshuIIITian

5,985
6
50
70

1

vote

1 answer

CRIU usage for Java application

So I want to use CRIU to make a snapshot of a JVM process and restore it later. For this purpose I wrote a little program which does nothing more but printing the counter every second: package some; public class Fun { public static void…

jvm migration restore checkpointing

asked Apr 08 '19 at 12:58

Aksim Elnik

425
6
27

1

vote

1 answer

Save and load checkpoint pytorch

i make a model and save the configuration as: def checkpoint(state, ep, filename='./Risultati/checkpoint.pth'): if ep == (n_epoch-1): print('Saving state...') …

python-3.x pytorch recurrent-neural-network checkpointing

asked Nov 29 '18 at 11:35

Marco Vicedomini

13
3

1

vote

1 answer

ImageProjectiveTransformV2 error in loading meta graph by import_meta_graph

I am trying to load meta graph of trained networks "name.ckpt-1.meta" using tf.train.import_meta_graph("./name.ckpt-1.meta") but the following error appears: Traceback (most recent call last): File…

tensorflow checkpointing

asked Nov 21 '18 at 09:57

Ali

387
4
11

1

vote

0 answers

Spark structured streaming change of Kafka brokers - effect on checkpoint

We have a spark structured streaming application running in production with inhouse managed Kafka (lets call it kafka-inhouse) We are deciding to migrate to aiven kafka cloud. Assuming: We consume all messages from kafka-inhouse, and then the new…

apache-spark apache-kafka spark-structured-streaming checkpointing

asked Oct 10 '18 at 06:49

Chandan Bhattad

351
1
5
21

1

vote

2 answers

Stop and Restart Training on VGG-16

I am using pre-trained VGG-16 model for image classification. I am adding custom last layer as the number of my classification classes are 10. I am training the model for 200 epochs. My question is: is there any way if I randomly stop (by closing…

python-3.x machine-learning keras checkpointing vgg-net

asked Aug 24 '18 at 16:49

Pervaiz Niazi

199
2
14

1

vote

1 answer

Does tf.train.CheckpointSaverHook in tf.train.MonitoredTrainingSession block training while checkpointing or it is done asynchronously?

I am pretty new in TensorFlow. I am currently curious to track the IO time and bandwidth (preferably percentage of IO time taken in the training process for checkpointing) for checkpointing which is performed by the internal checkpointing mechanism…

tensorflow io profiling checkpointing

asked Jul 07 '18 at 00:08

Fahim

11
2

1

vote

1 answer

Flink exactly once - checkpoint and barrier acknowledgement at sink

I have a Flink job with a sink that is writing the data into MongoDB. The sink is an implementation of RichSinkFunction. Externalized checkpointing enabled. The interval is 5000 mills and scheme is EXACTLY_ONCE. Flink version 1.3, Kafka (source…

apache-flink flink-streaming checkpointing

asked May 31 '18 at 01:53

Mudit bhaintwal

528
1
7
21

1

vote

1 answer

What would happen if I configured a local file system for Flink checkpointing?

I have saw a video named Managing State in Apache Flink - Tzu-Li (Gordon) Tai. In this video, it stores data with distributed file system. I'm wondering that what would happen if I configured a local file system for Flink checkpointing? eg:…

apache-flink flink-streaming checkpointing

asked Apr 12 '18 at 07:13

Brutal_JL

2,839
2
21
27

1

vote

1 answer

Docker Checkpoint/Restore with CRIU - Kernel doesn't support PTRACE_O_SUSPEND_SECCOMP

I am attempting to do a hello-world example of docker checkpoint/restore with CRIU (https://criu.org/Docker). Here is the output from criu check --all Error (criu/cr-check.c:648): Kernel doesn't support PTRACE_O_SUSPEND_SECCOMP Error…

docker centos7 rhel7 checkpointing

asked Mar 14 '18 at 15:48

DGardner42

31
5

Questions tagged [checkpointing]