Questions tagged [checkpointing]
105 questions
3
votes
1 answer
pytorch torch.load load_checkpoint and learning_rate
Following this medium post, I understand how to save and load my model (or at least I think I do). They say the learning_rate is saved. However, looking at this person's code (it's a github repo with lots of people watching, forking, etc. so I'm…

FluidMechanics Potential Flows
- 594
- 10
- 23
3
votes
1 answer
Azure Event Hubs Streaming: Does Checkpointing override setStartingPosition?
If we specify the starting position in EventHub conf like so:
EventHubsConf(ConnectionStringBuilder(eventHubConnectionString).build)
.setStartingPosition(EventPosition.fromStartOfStream)
or
…

Gadam
- 2,674
- 8
- 37
- 56
3
votes
2 answers
TF Keras ModelCheckpoint filepath batch number
I am using ModelCheckpoint to save checkpoints every 500 batches in every epoch. It is documented here https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint.
How would I set filepath to include the batch number? I know I can…

rishai
- 495
- 3
- 15
3
votes
0 answers
Spark Structured Streaming: Error reading delta file with hdfs checkpoint location
I want to run a Spark Structured Streaming job locally on a single machine. Unfortunately, recovering from an aborted job does not work when the job was aborted while processing data (it fails with the log shown below).
(If the streaming job is…

Esparko
- 31
- 1
3
votes
0 answers
Is there a way to export/checkpoint OpenCV Background Subtraction for later use?
Is there a way to export/checkpoint OpenCV Background Subtraction for later use?
I have some very long video files to process which require background removal. I would like to cut the video into small chunks and process each chunk separately. …

WesH
- 460
- 5
- 15
3
votes
1 answer
MapWithState gives java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast while recovering from checkpoint
I am facing an issue with spark streaming job where i am trying to use broadcast, mapWithState and checkpointing together in spark.
Following is the usage:
Since I have to pass some connection object (which is not Serializable) to the executors, I…

Saman
- 53
- 6
3
votes
2 answers
Is checkpointing necessary in spark streaming
I have noticed that spark streaming examples also have code for checkpointing. My question is how important is that checkpointing. If its there for fault tolerance, how often do faults happen in such streaming applications?

pythonic
- 20,589
- 43
- 136
- 219
3
votes
2 answers
h2o deeplearning checkpoint
I'm trying to run h2o.deeplearning twice, using checkpoint parameter
on 2 train sets (using same parameters except different epochs).
I'm getting the following error:
Error: 'The columns of the training data must be the same as for the checkpointed…

eli
- 81
- 5
3
votes
1 answer
Apache Spark - accessing internal data on RDDs?
I started doing the amp-camp 5 exercises. I tried the following 2 scenarios:
Scenario #1
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count
Scenario #2
val pagecounts =…

Jatin Ganhotra
- 6,825
- 6
- 48
- 71
2
votes
2 answers
Transparently replace file mapping with anonymous
I am doing a checkpoint-and restore using CRIU; in turn after restore, my application wakes with some threads that have their stack mmaped into files on disk (CRIU doesn't do it by default, this is a custom optimization). Later on, I want to…

Radim Vansa
- 5,686
- 2
- 25
- 40
2
votes
1 answer
How to restore a specific checkpoint in tensorflow2 (to implement early stopping)?
I used the following code to create a checkpoint manager outside of the loop that I train my model:
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(object_1=object_1)
ckpt_manager = tf.train.CheckpointManager(ckpt,…

khemedi
- 774
- 3
- 9
- 19
2
votes
0 answers
snakemake checkpoint calling variable not defined
I have the below snakefile with checkpoints. I am trying to run this for 2 samples (defined as RUNS). However, everytime I try I'm getting an additional variable included. Any thoughts on how to resolve this? Thank you..
import os
from tempfile…

Susheel Busi
- 163
- 8
2
votes
0 answers
Apache beam job on Flink checkpoint size growing over time
One of our Apache beam job running through the FlinkRunner is experiencing an odd behavior with checkpoint size. The state backend is Filebased. The job receives traffic once a day for a period of an hour and then is idle until it receives more…

TheFlyingFox
- 31
- 4
2
votes
1 answer
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:51
When I try to load a pytorch checkpoint:
checkpoint = torch.load(pathname)
I see:
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:51
I created the checkpoint with…

Tom Hale
- 40,825
- 36
- 187
- 242
2
votes
1 answer
How to set the setCheckpoint in pyspark
I don't know much spark. On the top of the code I have
from pysaprk.sql import SparkSession
import pyspark.sql.function as f
spark = SparkSession.bulder.appName(‘abc’).getOrCreate()
H = sqlContext.read.parquet(‘path to hdfs file’)
H has about 30…

pmjn6
- 307
- 1
- 4
- 14