Questions tagged [checkpointing]

105 questions
2
votes
1 answer

Checkpoint in Declarative Jenkins Pipeline

I am looking at Cloudbees documentation that says : The correct approach is to always keep the checkpoint step outside of any node block, not associated with either an agent or a workspace The sample example given is for a scripted pipeline. I…
Ram
  • 173
  • 2
  • 11
2
votes
2 answers

Where is the default checkpoint(s) kept in Apache Flink?

I am a newbie to Apache Flink, and I was going through the Apache Flink's examples. I found that in case of a failure Flink has the ability to restore stream processing from a checkpoint. StreamExecutionEnvironment env =…
himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
2
votes
0 answers

Reconnecting to MPI after Linux process state is restored

Storytime! Consider the following scenario: Using Hydra, MPICH spawns 2 different processes (Simulators). Call them Apple and Orange! Apple and Orange start, they load a dynamically linked library, and they use that library to call MPI_Init and do…
MehMastah
  • 21
  • 3
2
votes
0 answers

TensorFlow Checkpoints for Online Learning

I am trying to build an adaptable speech Recognition system based on Mozilla DeepSpeech (which is TensorFlow implementation of the DeepSpeech paper) The idea is that, We will pretrain a model on a certain voice. Then, save the model + create a…
2
votes
1 answer

S3 Checkpoint with Structured Streaming

I have tried the suggestions given in the Apache Spark (Structured Streaming) : S3 Checkpoint support I am still facing this issue. Below is the error i get 17/07/06 17:04:56 WARN FileSystem: "s3n" is a deprecated filesystem name. Use…
2
votes
0 answers

Failure to reload from checkpoint directory

When I tried reloading my spark streaming application from a checkpoint directory, I got the following exception: java.lang.IllegalArgumentException: requirement failed: Checkpoint directory does not exist:…
2
votes
0 answers

Read Spark Streaming checkpoint data

I'm writing a Spark Streaming application reading from Kafka. In order to have an exactly one semantic, I'd like to use the direct Kafka stream and using Spark Streaming native checkpointing. The problem is that checkpointing makes pratically…
mgaido
  • 2,987
  • 3
  • 17
  • 39
2
votes
1 answer

Variable scopes in Tensorflow

I am having problems making effective usage of variable scopes. I want to define some variables for weights, biases and inner state of a simple recurrent network. I call get_saver() once after defining the default graph. I then iterate over a batch…
diffeomorphism
  • 991
  • 2
  • 10
  • 27
2
votes
1 answer

What does checkpointing do on Apache Spark?

What does checkpointing do for Apache Spark, and does it take any hits on RAM or CPU?
cshin9
  • 1,440
  • 5
  • 20
  • 33
1
vote
0 answers

SSIS checkpoints are not re-starting correctly, skipping NON-checkpointed tasks

I have an SSIS package where the checkpoints are not behaving as I understand that they should. To simplify, this is the kind of setup: Imagine a package with two containers in a serial flow (Container 1 executes then Container 2). Checkpoints are…
Lee Cascio
  • 11
  • 2
1
vote
1 answer

How to configure checkpointing on an XTDB node using AWS S3

I am using XTDB 1.21.0 deployed on AWS/ECS (Fargate) with checkpoints configured (frequency 30 minutes) and stored on an S3 bucket (RocksDB). After a couple of successful checkpoints, they seem to be constantly failing with an XTDB warning due to an…
modality
  • 21
  • 2
1
vote
0 answers

Flink AT_LEAST_ONCE checkpoint uses 100% managed memory

We have a Flink streaming job v1.14 running in native K8S deployment mode. When we use AT_LEAST_ONCE checkpoint mode, the managed memory usage hits 100% no matter how many memory we assigned to it. Any ideas what might be the cause or is this…
周天钜
  • 33
  • 1
  • 4
1
vote
1 answer

Flink checkpointing working for ProcessFunction but not for AsyncFunction

I have operator checkpointing enabled and working smoothly for a ProcessFunction operator. On job failure I can see how operator state gets externalized on the snapshotState() hook, and on resume, I can see how state is restored at the…
1
vote
1 answer

Apache Flink to use S3 for backend state and checkpoints

Background I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error. Error org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system…
1
vote
0 answers

Apache Flink losing records when task manager is restarted

I am using Flink cluster with a job manager pod and two task manager pods in a kubernetes cluster. When I submit the streaming job to the job manager it runs the job and I receive the output into the sink. Also I have enabled checkpointing to…
user3553913
  • 373
  • 3
  • 17