Questions tagged [checkpointing]

105 questions
1
vote
2 answers

Spark session Null Pointer with Checkpointing

I have enabled checkpoint that saves the logs to S3. If there are NO files in the checkpoint directory, spark streaming works fine and I can see log files appearing in the checkpoint directory. Then I kill spark streaming and restart it. This time,…
Ahmed
  • 121
  • 6
  • 18
1
vote
1 answer

How to set the number of documents processed in a batch?

With Spark 2.2.0 checkpointing works a little different than the versions. There is a commits folder that gets created and after completion of every batch a file gets written to the folder. I am facing a scenario where in I have about 10k records…
fledgling
  • 991
  • 4
  • 25
  • 48
1
vote
1 answer

h2o checkpoint parameter change error - but no parameter changed??

I am trying to export the weights and biases of a "model" in which I did not originally train the model with "export_weights_and_biases = TRUE" Therefore, I'd like to try to checkpoint the model and try to export_weights_and_biases = TRUE in a new…
ogukku
  • 53
  • 7
1
vote
2 answers

checkpointing DataFrames in SparkR

I am looping over a number of csv data files using R/spark. About 1% of each file must be retained (filtered based on certain criteria) and merged with the next data file (I have used union/rbind). However, as the loop runs, the lineage of the…
Ott Toomet
  • 1,894
  • 15
  • 25
1
vote
1 answer

tensorflow : restore from checkpoint for continue training

in this case ,i want to continue train my model from checkpoint.i use the cifar-10 example and did a little change in cifar-10_train.py like below,they are almost the same,except i want to restore from checkpoint: i replaced cifar-10 by…
mdtry
  • 13
  • 1
  • 5
1
vote
1 answer

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka. I have activated checkpointing to recover the job in case of failure. The problem is that if the application fails, when it restarts it tries to execute all the data from the…
Erica
  • 1,608
  • 2
  • 21
  • 32
1
vote
0 answers

Recovery after driver failure by exception with spark-streaming

We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function…
1
vote
1 answer

What file systems can be used for checkpointing

The documentation says that any Hadoop API compatible file systems ( like HDFS , S3 ) can be used as checkpoint directory. My question is that apart from HDFS and S3 what are other practical alternatives for a spark streaming application using Kafka…
Soumitra
  • 604
  • 1
  • 8
  • 20
0
votes
0 answers

How to monitor the GridDB checkpoint log file, using Zabbix

I know this is a bit specific but hopefully someone has done this before. I am using Zabbix to monitor GradDB, and using the default provided "GridDB Monitoring Template" . I am interested in knowing Block management information, and one of the…
0
votes
0 answers

How to convert a checkpoint file to tensorflow.js?

I need step by step detailed instructions since I'm still a beginner. I tried entering the following code: import tensorflow.compat.v1 as tf meta_path = './newcheckpoint/.meta' # Your .meta file output_node_names = ['name_of_the_output_node'] #…
0
votes
0 answers

Flink job restarted with "org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 1 for job"

During checkpointing, if the folder where snapshot is to be saved is already present. Like in my case "chk-1" is the folder where snapshot is to be saved is already present. I get below exception & post that job gets restarted. WARN …
0
votes
0 answers

Expected all tensors to be on the same device, but found at least two devices

I am periodically saving checkpoints like this: loss = trn_metrics_t[METRICS_LOSS_NDX].mean().item() torch.save({ 'epoch': epoch_ndx - 1, 'model_state_dict': self.model.state_dict(), 'optimizer_state_dict':…
Paul Reiners
  • 8,576
  • 33
  • 117
  • 202
0
votes
3 answers

PostgreSQL - checkpoint interval behaviour in different WAL levels

I couldn't find a definite answer for my concerns, so I might as well ask it from you guys! Long story short: We need to perform an UPDATE command on roughly 400M rows. The command could be modified to work in batches I know, but that is a different…
Bylaw
  • 5
  • 6
0
votes
1 answer

Append model checkpoints to existing file in PyTorch

In PyTorch, it is possible to save model checkpoints as follows: import torch # Create a model model = torch.nn.Sequential( torch.nn.Linear(1, 50), torch.nn.Tanh(), torch.nn.Linear(50, 1) ) # ... some training here # Save…
Thomas Wagenaar
  • 6,489
  • 5
  • 30
  • 73
0
votes
1 answer

Flink Incremental CheckPointing Compaction

We have a forever running flink job which reads from kafka , creates sliding time windows with (stream intervals :1hr , 2 hr to 24 hr) and (slide intervals : 1 min , 10 min to 1 hours). basically its :…