Questions tagged [checkpointing]
105 questions
0
votes
2 answers
How can I use Tensorflow.Checkpoint to recover a previously trained net
I'm trying to understand how to recover a saved/checkpointed net using tensorflow.train.Checkpoint.restore.
I'm using code that's strongly based on Google's Colab tutorial for creating a pix2pix GAN. Below, I've excerpted the key portion, which just…

user1245262
- 6,968
- 8
- 50
- 77
0
votes
2 answers
Correctly sending Flink state to Kafka
I'm building a Kafka -> Flink -> Kafka pipeline that works with delineated "session" data. My input Kafka topic has data in the following format and constitutes one session for session_key:
start_event(session_key,…

kozyr
- 1,334
- 1
- 20
- 30
0
votes
1 answer
Checkpointing with LD_PRELOAD -- how to manipulate the instruction pointer and call stack?
The LD_PRELOAD technique allows us to supply our own custom standard library functions to an existing binary, overriding the standard ones or manipulating their behaviour, giving a fun way to experiment with a binary and understand its…

atomickitten
- 213
- 2
- 11
0
votes
1 answer
Apache Flink Checkpoining (Manually put a value into RocksDB Checkpoint and retrieve during recovery or Restart)
We have a scenario where we have to persist/save some value into the checkpoint and retrieve it back during failure recovery/application restart.
We followed a few things like ValueState, ValueStateDescriptor still not…

Ajit Raju
- 21
- 5
0
votes
3 answers
Bash script checkpoints
I am developing a big script which skeleton, looks like below:
#!/bin/bash
load_variables()
function_1()
function_2()
function_3()
[...]
function_n()
During each take-off, user flags are first loaded in load_variables() function.
Then script…

Karol Mazurek
- 37
- 8
0
votes
1 answer
How to get layer execution time on an AI model saved as .pth file?
I'm trying to run a Resnet-like image classification model on a CPU, and want to know the breakdown of time it takes to run each layer of the model.
The issue I'm facing is the github link…

Joe Black
- 625
- 6
- 19
0
votes
0 answers
Opening RocksDB in Java with existing checkpoint files
I have a streaming pipeline that uses rocksdbjni 6.15.2 to manage and checkpoint state. I'm trying to use this same library in a separate offline Scala process to read the checkpoint files, and do some further processing.
To test, I copied one of…
0
votes
1 answer
Update EventHub Partition Offsett Checkpoint on Azure.Messaging.EventHubs.EventProcessorClient When Idle
In my scenario I will have batches of events coming in all at once and then long periods of time when the EventHub will be idle. In my processor client I want to checkpoint every N events or N minutes (whichever comes first).
Here is how I've set up…

INNVTV
- 3,155
- 7
- 37
- 71
0
votes
2 answers
Stream Processing: How often should a checkpoint be initiated?
I am setting up an analytics pipeline using Apache Flink to process a stream of IoT data. While attempting to configure the system, I cannot seem to find any sources for how often checkpointing should be initiated? Are there any recommendations or…

Hegemon
- 77
- 10
0
votes
0 answers
Spark structured and Dstream application is writing duplicates
We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream…

GG GVG
- 1
- 1
0
votes
1 answer
Spark Scala Checkpointing Data Set showing .isCheckpointed = false after Action but directories written
There seem to be a few postings on this but none seem to answer what I understand.
The following code run on DataBricks:
spark.sparkContext.setCheckpointDir("/dbfs/FileStore/checkpoint/cp1/loc7")
val checkpointDir =…

thebluephantom
- 16,458
- 8
- 40
- 83
0
votes
1 answer
How to persist a Queryable State in Flink?
I am using FLink v.1.4.0. I am using a QueryableStateStream which I key in some way and then sink it to create a Queryable State, e.g:
stream.keyBy(0).asQueryableState("query-name");
That's all good as long as my Flink job is running. As soon as…

Christos Hadjinikolis
- 2,099
- 3
- 20
- 46
0
votes
1 answer
Validation Split and Checkpoint Best Model in Keras
Let us use a validation split of 0.3 when fitting a Sequential model. What will be used for validation, the first or the last 30% samples?
Secondly, checkpointing the best model saves the best model weights in .hdf5 file format. Does this mean that,…

Khan
- 81
- 2
- 7
0
votes
1 answer
Spark on EMR "exceeding memory limits" for checkpointed/cached job
Is my understanding of caching wrong? The resulting RDD after all my transformations is incredibly small, like 1 GB. The data it was computed from is quite large, ~700 GB in size.
I have to run logic to read in thousands of pretty big files, all to…

pigate
- 351
- 1
- 7
- 16
0
votes
0 answers
How to save large datasets to a checkpoint-file in MATLAB when training a neural network?
When training a neural network with large datasets (and few features) in MATLAB, the tr structure will grow above 2 GB making the automatic checkpoint-saving feature unusable. MATLAB throws the following error:
Warning: Variable 'checkpoint' was not…

fixingstuff
- 559
- 2
- 7
- 18