Questions tagged [checkpointing]

105 questions
0
votes
2 answers

How can I use Tensorflow.Checkpoint to recover a previously trained net

I'm trying to understand how to recover a saved/checkpointed net using tensorflow.train.Checkpoint.restore. I'm using code that's strongly based on Google's Colab tutorial for creating a pix2pix GAN. Below, I've excerpted the key portion, which just…
user1245262
  • 6,968
  • 8
  • 50
  • 77
0
votes
2 answers

Correctly sending Flink state to Kafka

I'm building a Kafka -> Flink -> Kafka pipeline that works with delineated "session" data. My input Kafka topic has data in the following format and constitutes one session for session_key: start_event(session_key,…
kozyr
  • 1,334
  • 1
  • 20
  • 30
0
votes
1 answer

Checkpointing with LD_PRELOAD -- how to manipulate the instruction pointer and call stack?

The LD_PRELOAD technique allows us to supply our own custom standard library functions to an existing binary, overriding the standard ones or manipulating their behaviour, giving a fun way to experiment with a binary and understand its…
atomickitten
  • 213
  • 2
  • 11
0
votes
1 answer

Apache Flink Checkpoining (Manually put a value into RocksDB Checkpoint and retrieve during recovery or Restart)

We have a scenario where we have to persist/save some value into the checkpoint and retrieve it back during failure recovery/application restart. We followed a few things like ValueState, ValueStateDescriptor still not…
Ajit Raju
  • 21
  • 5
0
votes
3 answers

Bash script checkpoints

I am developing a big script which skeleton, looks like below: #!/bin/bash load_variables() function_1() function_2() function_3() [...] function_n() During each take-off, user flags are first loaded in load_variables() function. Then script…
0
votes
1 answer

How to get layer execution time on an AI model saved as .pth file?

I'm trying to run a Resnet-like image classification model on a CPU, and want to know the breakdown of time it takes to run each layer of the model. The issue I'm facing is the github link…
Joe Black
  • 625
  • 6
  • 19
0
votes
0 answers

Opening RocksDB in Java with existing checkpoint files

I have a streaming pipeline that uses rocksdbjni 6.15.2 to manage and checkpoint state. I'm trying to use this same library in a separate offline Scala process to read the checkpoint files, and do some further processing. To test, I copied one of…
0
votes
1 answer

Update EventHub Partition Offsett Checkpoint on Azure.Messaging.EventHubs.EventProcessorClient When Idle

In my scenario I will have batches of events coming in all at once and then long periods of time when the EventHub will be idle. In my processor client I want to checkpoint every N events or N minutes (whichever comes first). Here is how I've set up…
INNVTV
  • 3,155
  • 7
  • 37
  • 71
0
votes
2 answers

Stream Processing: How often should a checkpoint be initiated?

I am setting up an analytics pipeline using Apache Flink to process a stream of IoT data. While attempting to configure the system, I cannot seem to find any sources for how often checkpointing should be initiated? Are there any recommendations or…
0
votes
0 answers

Spark structured and Dstream application is writing duplicates

We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream…
GG GVG
  • 1
  • 1
0
votes
1 answer

Spark Scala Checkpointing Data Set showing .isCheckpointed = false after Action but directories written

There seem to be a few postings on this but none seem to answer what I understand. The following code run on DataBricks: spark.sparkContext.setCheckpointDir("/dbfs/FileStore/checkpoint/cp1/loc7") val checkpointDir =…
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
0
votes
1 answer

How to persist a Queryable State in Flink?

I am using FLink v.1.4.0. I am using a QueryableStateStream which I key in some way and then sink it to create a Queryable State, e.g: stream.keyBy(0).asQueryableState("query-name"); That's all good as long as my Flink job is running. As soon as…
0
votes
1 answer

Validation Split and Checkpoint Best Model in Keras

Let us use a validation split of 0.3 when fitting a Sequential model. What will be used for validation, the first or the last 30% samples? Secondly, checkpointing the best model saves the best model weights in .hdf5 file format. Does this mean that,…
Khan
  • 81
  • 2
  • 7
0
votes
1 answer

Spark on EMR "exceeding memory limits" for checkpointed/cached job

Is my understanding of caching wrong? The resulting RDD after all my transformations is incredibly small, like 1 GB. The data it was computed from is quite large, ~700 GB in size. I have to run logic to read in thousands of pretty big files, all to…
0
votes
0 answers

How to save large datasets to a checkpoint-file in MATLAB when training a neural network?

When training a neural network with large datasets (and few features) in MATLAB, the tr structure will grow above 2 GB making the automatic checkpoint-saving feature unusable. MATLAB throws the following error: Warning: Variable 'checkpoint' was not…
fixingstuff
  • 559
  • 2
  • 7
  • 18