Questions tagged [spark-checkpoint]

51 questions
1
vote
1 answer

Change spark.dynamicAllocation.cachedExecutorIdleTimeout after rdd checkpoint?

A Spark job runs expensive computations in the first stage and I checkpoint the resulting RDD so that they don't have to be repeated in case executors are preempted (it runs on Yarn with preemption). Job is also using a high timeout value for…
Uwe Brandt
  • 341
  • 2
  • 8
1
vote
0 answers

Spark Structure Streaming checkpoint vs spark context CheckPointDir

Hello stack overflow community. I'm using a spark streaming app in production environment and it was noticed that spark-checkpoints are contributing greatly to the under replication factor in HDFS and thus affects the HDFS stability. I'm trying to…
MsCurious
  • 175
  • 1
  • 12
1
vote
2 answers

Specifying checkpoint location when structured streaming the data from kafka topics

I have built a spark structured streaming application which reads the data from kafka topics,I have specified the starting offsets as latest and what happens if there is any failure from spark side, from which point/offset the data will continue to…
1
vote
0 answers

(py)Spark checkpointing consumes driver memory

Context I have a pySpark-query that creates a rather large DAG. Thus, I break the lineage using checkpoint(eager=True) to shrink it which normally works. Note: I do not use localCheckpoint() since I use dynamic ressource allocation (see the docs for…
Markus
  • 2,265
  • 5
  • 28
  • 54
1
vote
1 answer

Spark structured streaming- checkpoint metadata growing indefinitely

I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction interval is 10 (default) and I set spark.sql.streaming.minBatchesToRetain=5. When the job…
1
vote
1 answer

delta mergeSchema doesn't work using MemoryStream with spark checkpoint

I am testing a DeltaWriter class using MemoryStream by spark for creating a stream (rather than readStream) and i want to write the result on s3 as delta file with option "mergeSchema": true as reported below: import…
1
vote
0 answers

Spark structured streaming - reading from last read processed message after service restart

I am currently reading from a kafka topic, processing the messages and writing them to another topic. This processing and producing logic is inside the test_saprk function. A code sample can be found below: df_file = ( …
1
vote
0 answers

OOM and data loss issues using checkpoints with spark streaming (pyspark) on Databricks

I have encountered many issues using checkpoints with spark streaming on databricks. The code below led to OOM errors on our clusters. Investigating the cluster's memory usage, we could see that the memory was slowly increasing over time, indicating…
1
vote
1 answer

Why is checkpoint() faster than persist()

I have a code that does calculations with a DataFrame. +------------------------------------+------------+----------+----+------+ | Name| …
Mardaunt
  • 82
  • 1
  • 13
1
vote
1 answer

PySpark - Read checkpointed DataFrame

i am currently using pyspark to perform some data cleaning for a machine learning application. The last session crashed but i set up an checkpointdir and checkpointed my DataFrame. Now i have checkpointed data directory in the form…
Joschua Xner
  • 95
  • 1
  • 10
1
vote
0 answers

How can I load a checkpointed pyspark dataframe

My code below crashed, and instead of to restart from the start, I would like to start from the last checkpointed dataframe. How can I load it? I have got this folder in my directory…
Florian
  • 194
  • 2
  • 17
1
vote
1 answer

Spark streaming checkpointing issue with Azure blob storage : Error in TaskCompletionListener null

I am using checkpoint functionality of spark structured streaming with storage for chekpoint metadata as azure blob. But I am getting below error, from the logs it seems it is deleting temp file and trying to access it again . Below is the detail…
1
vote
1 answer

How spark calculates the window start time with given window interval?

Consider I have a input df with a timestamp field column and when setting window duration (with no sliding interval) as : 10 minutes with input of time(2019-02-28 22:33:02) window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:40:02) 8…
1
vote
1 answer

How to clean up the checkpoint files accumulated in spark structured streaming?

I added the checkpoint for SparkContext and write query for kafka data streaming for the long run spark structured streaming job. spark.sparkContext.setCheckpointDir("/tmp/checkpoint") ... val monitoring_stream = monitoring_df.writeStream …
1
vote
1 answer

spark checkpoint : error java.io.FileNotFoundException

I have a current pipeline, where I do several transformations to my dataframe. It is important to insert checkpoints to assure an accepted execution time. However from time to time I get this error from any of the checkpoints: Job aborted due to…
drlol
  • 333
  • 4
  • 18