Questions tagged [spark-checkpoint]
51 questions
1
vote
1 answer
Change spark.dynamicAllocation.cachedExecutorIdleTimeout after rdd checkpoint?
A Spark job runs expensive computations in the first stage and I checkpoint the resulting RDD so that they don't have to be repeated in case executors are preempted (it runs on Yarn with preemption). Job is also using a high timeout value for…

Uwe Brandt
- 341
- 2
- 8
1
vote
0 answers
Spark Structure Streaming checkpoint vs spark context CheckPointDir
Hello stack overflow community.
I'm using a spark streaming app in production environment and it was noticed that spark-checkpoints are contributing greatly to the under replication factor in HDFS and thus affects the HDFS stability. I'm trying to…

MsCurious
- 175
- 1
- 12
1
vote
2 answers
Specifying checkpoint location when structured streaming the data from kafka topics
I have built a spark structured streaming application which reads the data from kafka topics,I have specified the starting offsets as latest and what happens if there is any failure from spark side, from which point/offset the data will continue to…

swetha k
- 11
- 2
1
vote
0 answers
(py)Spark checkpointing consumes driver memory
Context
I have a pySpark-query that creates a rather large DAG. Thus, I break the lineage using checkpoint(eager=True) to shrink it which normally works.
Note: I do not use localCheckpoint() since I use dynamic ressource allocation (see the docs for…

Markus
- 2,265
- 5
- 28
- 54
1
vote
1 answer
Spark structured streaming- checkpoint metadata growing indefinitely
I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction interval is 10 (default) and I set spark.sql.streaming.minBatchesToRetain=5. When the job…

wind
- 892
- 1
- 11
- 27
1
vote
1 answer
delta mergeSchema doesn't work using MemoryStream with spark checkpoint
I am testing a DeltaWriter class using MemoryStream by spark for creating a stream (rather than readStream) and i want to write the result on s3 as delta file with option "mergeSchema": true as reported below:
import…

b-j
- 11
- 2
1
vote
0 answers
Spark structured streaming - reading from last read processed message after service restart
I am currently reading from a kafka topic, processing the messages and writing them to another topic. This processing and producing logic is inside the test_saprk function. A code sample can be found below:
df_file = (
…

J.Doe
- 529
- 4
- 14
1
vote
0 answers
OOM and data loss issues using checkpoints with spark streaming (pyspark) on Databricks
I have encountered many issues using checkpoints with spark streaming on databricks. The code below led to OOM errors on our clusters. Investigating the cluster's memory usage, we could see that the memory was slowly increasing over time, indicating…

Noé Achache
- 195
- 2
- 9
1
vote
1 answer
Why is checkpoint() faster than persist()
I have a code that does calculations with a DataFrame.
+------------------------------------+------------+----------+----+------+
| Name| …

Mardaunt
- 82
- 1
- 13
1
vote
1 answer
PySpark - Read checkpointed DataFrame
i am currently using pyspark to perform some data cleaning for a machine learning application.
The last session crashed but i set up an checkpointdir and checkpointed my DataFrame.
Now i have checkpointed data directory in the form…

Joschua Xner
- 95
- 1
- 10
1
vote
0 answers
How can I load a checkpointed pyspark dataframe
My code below crashed, and instead of to restart from the start, I would like to start from the last checkpointed dataframe. How can I load it? I have got this folder in my directory…

Florian
- 194
- 2
- 17
1
vote
1 answer
Spark streaming checkpointing issue with Azure blob storage : Error in TaskCompletionListener null
I am using checkpoint functionality of spark structured streaming with storage for chekpoint metadata as azure blob.
But I am getting below error, from the logs it seems it is deleting temp file and trying to access it again .
Below is the detail…

Debug Logs
- 61
- 6
1
vote
1 answer
How spark calculates the window start time with given window interval?
Consider I have a input df with a timestamp field column and when setting window duration (with no sliding interval) as :
10 minutes
with input of time(2019-02-28 22:33:02)
window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:40:02)
8…

supernatural
- 1,107
- 11
- 34
1
vote
1 answer
How to clean up the checkpoint files accumulated in spark structured streaming?
I added the checkpoint for SparkContext and write query for kafka data streaming for the long run spark structured streaming job.
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
...
val monitoring_stream = monitoring_df.writeStream
…

yyuankm
- 295
- 4
- 22
1
vote
1 answer
spark checkpoint : error java.io.FileNotFoundException
I have a current pipeline, where I do several transformations to my dataframe.
It is important to insert checkpoints to assure an accepted execution time.
However from time to time I get this error from any of the checkpoints:
Job aborted due to…

drlol
- 333
- 4
- 18