Questions tagged [spark-checkpoint]
51 questions
1
vote
1 answer
Using checkpointed dataframe to overwrite table fails with FileNotFoundException
I have some dataframe df in pySpark, which results from calling:
df = spark.sql("select A, B from org_table")
df = df.stuffIdo
I want to overwrite org_table at the end of my script.
Since overwriting input-tabels is forbidden, I checkpointed my…

Markus
- 2,265
- 5
- 28
- 54
1
vote
0 answers
checkpoint variables used in Spark driver
I am streaming data from Kafka , and also maintaining state in my application (by using updateStateByKey) , and so i mandatorily need to checkpoint my data. This is working well.
In addition to data from kafka, i am also using some local variables…

Amanpreet Khurana
- 549
- 1
- 5
- 17
0
votes
0 answers
clean checkpoint state files of spark stateful structured streaming
I struggle to find a solution for cleaning the checkpoint state files whose number grows overtime after I start a spark stateful structured streaming which ends up take up a lot of disk space. When saying checkpoint state file I mean the delta and…

lize su
- 11
- 5
0
votes
0 answers
Using GCS bucket for checkpoints in Spark Structured Streaming
We are performing a POC to run a Spark Structured Streaming on GKE (using spark-operator) and we plan to store our checkpoints in GCS.
From the GCS documentation, it seems that having the storage bucket within the same location as GKE with Location…

it243
- 71
- 7
0
votes
0 answers
Pre-populate a bronze delta table from a silver table using a batch job, then stream to it from the same table
I have a pipeline like this:
kafka->bronze->silver
The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming.
I changed the silver schema, so I want to reload from the bronze into…

user961826
- 564
- 6
- 14
0
votes
0 answers
Disadvantages of streaming from Parquet source in Spark Structured Streaming
What are the potential disadvantages (if any) of streaming micro-batches from a HDFS/S3 backed parquet files as against standard sources like Kafka for a long running Spark Structured Streaming job?

John Subas
- 81
- 1
- 11
0
votes
0 answers
Spark S3 Checkpointing error after enabling RocksDb
We are running Spark Streaming state based application on OpenShift cluster. We are using Amazon S3 for checkpointing. Rocks DB has also been enabled using configuration - "spark.sql.streaming.stateStore.providerClass" =>…
0
votes
0 answers
timeout of 60000ms expired before the position for partition could be determined
I am using structured streaming with subscribepattern with a checkpoint location.
If I just delete a topic, the stream updates metadata and everything looks fine. But if topic is deleted while data is published to that topic and stream is running…

notesdvi
- 1
- 1
0
votes
1 answer
offset management in spark streaming
As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times)…

Gaurav Gupta
- 159
- 1
- 17
0
votes
1 answer
How to reduce number of checkpoint files writen by spark streaming
If spark streaming job involves shuffle and stateful processing, it's easy to generate lots of small files per micro batch. We should decrease the number of files without hurting latency.

Warren Zhu
- 1,355
- 11
- 12
0
votes
2 answers
Apache Spark Structured Streaming - not writing to checkpoint location
i have a simple Apache Spark Structured Streaming python code, which reads data from Kafka, and writes the messages to console.
i've setup checkpoint location, however the code is not writing to checkpoint..
any ideas why ?
Here is the code:
from…

Karan Alang
- 869
- 2
- 10
- 35
0
votes
1 answer
Spark structures streaming too many threads with checkpointing on S3
Spark 3.0.1
hadoop-aws 3.2.0
I have a simple spark streaming application that reads messages from Kafka topic, aggregates them and writes into Elasticsearch. I am using checkpointing and an S3 bucket to store them.
After some time application…
0
votes
1 answer
Sliding Window without watermark in Apache Spark?
Considering I have a simple aggregation with a window defined without any watermark say.
df
.groupBy(window(col("time"), "30 minutes","10 minutes").as("time"))
.aggr ....
Here as our window is 30 minutes, and a sliding interval of 10 minutes
Q1.…

supernatural
- 1,107
- 11
- 34
0
votes
0 answers
Structured Streaming in Databricks Azure throwing exception - java.lang.IllegalStateException: Error reading delta file dbfs:/raw_zone/1.delta
We are using Structured Streaming in Databricks environment, Every time while we run this program - kAFKA - Structured Streaming (DBR6.6, Spark 2.4.5) - Writing to CosmosDB, we are getting the same exception as below just before we do the final…

Vishnu
- 41
- 3
0
votes
1 answer
Spark Structured Streaming using spark-acid writeStream (with checkpoint) throwing org.apache.hadoop.fs.FileAlreadyExistsException
In our Spark app, we use Spark structured streaming. It uses Kafka as input stream, & HiveAcid as writeStream to Hive table.
For HiveAcid, it is open source library called spark acid from qubole: https://github.com/qubole/spark-acid
Below is our…

Shuwn Yuan Tee
- 5,578
- 6
- 28
- 42