Questions tagged [checkpointing]
105 questions
1
vote
2 answers
Spark session Null Pointer with Checkpointing
I have enabled checkpoint that saves the logs to S3.
If there are NO files in the checkpoint directory, spark streaming works fine and I can see log files appearing in the checkpoint directory. Then I kill spark streaming and restart it. This time,…

Ahmed
- 121
- 6
- 18
1
vote
1 answer
How to set the number of documents processed in a batch?
With Spark 2.2.0 checkpointing works a little different than the versions. There is a commits folder that gets created and after completion of every batch a file gets written to the folder.
I am facing a scenario where in I have about 10k records…

fledgling
- 991
- 4
- 25
- 48
1
vote
1 answer
h2o checkpoint parameter change error - but no parameter changed??
I am trying to export the weights and biases of a "model" in which I did not originally train the model with "export_weights_and_biases = TRUE"
Therefore, I'd like to try to checkpoint the model and try to export_weights_and_biases = TRUE in a new…

ogukku
- 53
- 7
1
vote
2 answers
checkpointing DataFrames in SparkR
I am looping over a number of csv data files using R/spark. About 1% of each file must be retained (filtered based on certain criteria) and merged with the next data file (I have used union/rbind). However, as the loop runs, the lineage of the…

Ott Toomet
- 1,894
- 15
- 25
1
vote
1 answer
tensorflow : restore from checkpoint for continue training
in this case ,i want to continue train my model from checkpoint.i use the cifar-10 example and did a little change in cifar-10_train.py like below,they are almost the same,except i want to restore from checkpoint:
i replaced cifar-10 by…

mdtry
- 13
- 1
- 5
1
vote
1 answer
Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch
I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the…

Erica
- 1,608
- 2
- 21
- 32
1
vote
0 answers
Recovery after driver failure by exception with spark-streaming
We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function…

naticos
- 11
- 2
1
vote
1 answer
What file systems can be used for checkpointing
The documentation says that any Hadoop API compatible file systems ( like HDFS , S3 ) can be used as checkpoint directory.
My question is that apart from HDFS and S3 what are other practical alternatives for a spark streaming application using Kafka…

Soumitra
- 604
- 1
- 8
- 20
0
votes
0 answers
How to monitor the GridDB checkpoint log file, using Zabbix
I know this is a bit specific but hopefully someone has done this before.
I am using Zabbix to monitor GradDB, and using the default provided "GridDB Monitoring Template" .
I am interested in knowing Block management information, and one of the…

Pratik Dwivedi
- 53
- 5
0
votes
0 answers
How to convert a checkpoint file to tensorflow.js?
I need step by step detailed instructions since I'm still a beginner.
I tried entering the following code:
import tensorflow.compat.v1 as tf
meta_path = './newcheckpoint/.meta' # Your .meta file
output_node_names = ['name_of_the_output_node'] #…
0
votes
0 answers
Flink job restarted with "org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 1 for job"
During checkpointing, if the folder where snapshot is to be saved is already present. Like in my case "chk-1" is the folder where snapshot is to be saved is already present. I get below exception & post that job gets restarted.
WARN …

Chuni Lal Kukreja
- 11
- 3
0
votes
0 answers
Expected all tensors to be on the same device, but found at least two devices
I am periodically saving checkpoints like this:
loss = trn_metrics_t[METRICS_LOSS_NDX].mean().item()
torch.save({
'epoch': epoch_ndx - 1,
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict':…

Paul Reiners
- 8,576
- 33
- 117
- 202
0
votes
3 answers
PostgreSQL - checkpoint interval behaviour in different WAL levels
I couldn't find a definite answer for my concerns, so I might as well ask it from you guys!
Long story short:
We need to perform an UPDATE command on roughly 400M rows. The command could be modified to work in batches I know, but that is a different…

Bylaw
- 5
- 6
0
votes
1 answer
Append model checkpoints to existing file in PyTorch
In PyTorch, it is possible to save model checkpoints as follows:
import torch
# Create a model
model = torch.nn.Sequential(
torch.nn.Linear(1, 50),
torch.nn.Tanh(),
torch.nn.Linear(50, 1)
)
# ... some training here
# Save…

Thomas Wagenaar
- 6,489
- 5
- 30
- 73
0
votes
1 answer
Flink Incremental CheckPointing Compaction
We have a forever running flink job which reads from kafka , creates sliding time windows with (stream intervals :1hr , 2 hr to 24 hr) and (slide intervals : 1 min , 10 min to 1 hours).
basically its :…

Pritam Agarwala
- 1
- 2