Highest Voted 'checkpointing' Questions

1

vote

2 answers

Spark session Null Pointer with Checkpointing

I have enabled checkpoint that saves the logs to S3. If there are NO files in the checkpoint directory, spark streaming works fine and I can see log files appearing in the checkpoint directory. Then I kill spark streaming and restart it. This time,…

asked Sep 13 '17 at 00:47

Ahmed

121
6
18

1

vote

1 answer

How to set the number of documents processed in a batch?

With Spark 2.2.0 checkpointing works a little different than the versions. There is a commits folder that gets created and after completion of every batch a file gets written to the folder. I am facing a scenario where in I have about 10k records…

apache-spark spark-structured-streaming checkpointing

asked Jul 20 '17 at 19:29

fledgling

991
4
25
48

1

vote

1 answer

h2o checkpoint parameter change error - but no parameter changed??

I am trying to export the weights and biases of a "model" in which I did not originally train the model with "export_weights_and_biases = TRUE" Therefore, I'd like to try to checkpoint the model and try to export_weights_and_biases = TRUE in a new…

r h2o checkpointing

asked Jun 03 '17 at 07:33

ogukku

53
7

1

vote

2 answers

checkpointing DataFrames in SparkR

I am looping over a number of csv data files using R/spark. About 1% of each file must be retained (filtered based on certain criteria) and merged with the next data file (I have used union/rbind). However, as the loop runs, the lineage of the…

r apache-spark checkpointing

asked Mar 14 '17 at 22:21

Ott Toomet

1,894
15
25

1

vote

1 answer

tensorflow : restore from checkpoint for continue training

in this case ,i want to continue train my model from checkpoint.i use the cifar-10 example and did a little change in cifar-10_train.py like below,they are almost the same,except i want to restore from checkpoint: i replaced cifar-10 by…

tensorflow restore checkpointing

asked Sep 02 '16 at 13:55

mdtry

13
1
5

1

vote

1 answer

Spark streaming with Kafka: when recovering form checkpointing all data are processed in only one micro batch

I'm running a Spark Streaming application that reads data from Kafka. I have activated checkpointing to recover the job in case of failure. The problem is that if the application fails, when it restarts it tries to execute all the data from the…

apache-spark spark-streaming checkpointing

asked Jun 22 '16 at 10:56

Erica

1,608
2
21
32

1

vote

0 answers

Recovery after driver failure by exception with spark-streaming

We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function…

apache-spark cassandra spark-streaming spark-cassandra-connector checkpointing

asked Mar 16 '16 at 00:34

naticos

11
2

1

vote

1 answer

What file systems can be used for checkpointing

The documentation says that any Hadoop API compatible file systems ( like HDFS , S3 ) can be used as checkpoint directory. My question is that apart from HDFS and S3 what are other practical alternatives for a spark streaming application using Kafka…

apache-spark hdfs spark-streaming checkpointing

asked Jan 07 '16 at 00:03

Soumitra

604
1
8
20

0

votes

0 answers

How to monitor the GridDB checkpoint log file, using Zabbix

I know this is a bit specific but hopefully someone has done this before. I am using Zabbix to monitor GradDB, and using the default provided "GridDB Monitoring Template" . I am interested in knowing Block management information, and one of the…

zabbix database-partitioning griddb checkpointing

asked Aug 25 '23 at 17:39

Pratik Dwivedi

53
5

0

votes

0 answers

How to convert a checkpoint file to tensorflow.js?

I need step by step detailed instructions since I'm still a beginner. I tried entering the following code: import tensorflow.compat.v1 as tf meta_path = './newcheckpoint/.meta' # Your .meta file output_node_names = ['name_of_the_output_node'] #…

tensorflow machine-learning type-conversion tensorflow2.0 checkpointing

asked Jul 29 '23 at 12:32

Christopher Koh

1

0

votes

0 answers

Flink job restarted with "org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 1 for job"

During checkpointing, if the folder where snapshot is to be saved is already present. Like in my case "chk-1" is the folder where snapshot is to be saved is already present. I get below exception & post that job gets restarted. WARN …

apache-flink flink-streaming snapshot checkpointing savepoints

asked Apr 12 '23 at 14:47

Chuni Lal Kukreja

11
3

0

votes

0 answers

Expected all tensors to be on the same device, but found at least two devices

I am periodically saving checkpoints like this: loss = trn_metrics_t[METRICS_LOSS_NDX].mean().item() torch.save({ 'epoch': epoch_ndx - 1, 'model_state_dict': self.model.state_dict(), 'optimizer_state_dict':…

torch checkpointing

asked Mar 26 '23 at 14:28

Paul Reiners

8,576
33
117
202

0

votes

3 answers

PostgreSQL - checkpoint interval behaviour in different WAL levels

I couldn't find a definite answer for my concerns, so I might as well ask it from you guys! Long story short: We need to perform an UPDATE command on roughly 400M rows. The command could be modified to work in batches I know, but that is a different…

postgresql database-replication wal checkpointing

asked Mar 07 '23 at 12:37

Bylaw

5
6

0

votes

1 answer

Append model checkpoints to existing file in PyTorch

In PyTorch, it is possible to save model checkpoints as follows: import torch # Create a model model = torch.nn.Sequential( torch.nn.Linear(1, 50), torch.nn.Tanh(), torch.nn.Linear(50, 1) ) # ... some training here # Save…

python file pytorch save checkpointing

asked Feb 21 '23 at 12:23

Thomas Wagenaar

6,489
5
30
73

0

votes

1 answer

Flink Incremental CheckPointing Compaction

We have a forever running flink job which reads from kafka , creates sliding time windows with (stream intervals :1hr , 2 hr to 24 hr) and (slide intervals : 1 min , 10 min to 1 hours). basically its :…

streaming apache-flink flink-sql checkpointing flink-checkpoint

asked Nov 07 '22 at 15:28

Pritam Agarwala

1
2

Questions tagged [checkpointing]