Questions tagged [spark-checkpoint]

51 questions
1
vote
1 answer

Using checkpointed dataframe to overwrite table fails with FileNotFoundException

I have some dataframe df in pySpark, which results from calling: df = spark.sql("select A, B from org_table") df = df.stuffIdo I want to overwrite org_table at the end of my script. Since overwriting input-tabels is forbidden, I checkpointed my…
Markus
  • 2,265
  • 5
  • 28
  • 54
1
vote
0 answers

checkpoint variables used in Spark driver

I am streaming data from Kafka , and also maintaining state in my application (by using updateStateByKey) , and so i mandatorily need to checkpoint my data. This is working well. In addition to data from kafka, i am also using some local variables…
0
votes
0 answers

clean checkpoint state files of spark stateful structured streaming

I struggle to find a solution for cleaning the checkpoint state files whose number grows overtime after I start a spark stateful structured streaming which ends up take up a lot of disk space. When saying checkpoint state file I mean the delta and…
0
votes
0 answers

Using GCS bucket for checkpoints in Spark Structured Streaming

We are performing a POC to run a Spark Structured Streaming on GKE (using spark-operator) and we plan to store our checkpoints in GCS. From the GCS documentation, it seems that having the storage bucket within the same location as GKE with Location…
0
votes
0 answers

Pre-populate a bronze delta table from a silver table using a batch job, then stream to it from the same table

I have a pipeline like this: kafka->bronze->silver The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming. I changed the silver schema, so I want to reload from the bronze into…
0
votes
0 answers

Disadvantages of streaming from Parquet source in Spark Structured Streaming

What are the potential disadvantages (if any) of streaming micro-batches from a HDFS/S3 backed parquet files as against standard sources like Kafka for a long running Spark Structured Streaming job?
0
votes
0 answers

Spark S3 Checkpointing error after enabling RocksDb

We are running Spark Streaming state based application on OpenShift cluster. We are using Amazon S3 for checkpointing. Rocks DB has also been enabled using configuration - "spark.sql.streaming.stateStore.providerClass" =>…
0
votes
0 answers

timeout of 60000ms expired before the position for partition could be determined

I am using structured streaming with subscribepattern with a checkpoint location. If I just delete a topic, the stream updates metadata and everything looks fine. But if topic is deleted while data is published to that topic and stream is running…
0
votes
1 answer

offset management in spark streaming

As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times)…
0
votes
1 answer

How to reduce number of checkpoint files writen by spark streaming

If spark streaming job involves shuffle and stateful processing, it's easy to generate lots of small files per micro batch. We should decrease the number of files without hurting latency.
0
votes
2 answers

Apache Spark Structured Streaming - not writing to checkpoint location

i have a simple Apache Spark Structured Streaming python code, which reads data from Kafka, and writes the messages to console. i've setup checkpoint location, however the code is not writing to checkpoint.. any ideas why ? Here is the code: from…
0
votes
1 answer

Spark structures streaming too many threads with checkpointing on S3

Spark 3.0.1 hadoop-aws 3.2.0 I have a simple spark streaming application that reads messages from Kafka topic, aggregates them and writes into Elasticsearch. I am using checkpointing and an S3 bucket to store them. After some time application…
0
votes
1 answer

Sliding Window without watermark in Apache Spark?

Considering I have a simple aggregation with a window defined without any watermark say. df .groupBy(window(col("time"), "30 minutes","10 minutes").as("time")) .aggr .... Here as our window is 30 minutes, and a sliding interval of 10 minutes Q1.…
0
votes
0 answers

Structured Streaming in Databricks Azure throwing exception - java.lang.IllegalStateException: Error reading delta file dbfs:/raw_zone/1.delta

We are using Structured Streaming in Databricks environment, Every time while we run this program - kAFKA - Structured Streaming (DBR6.6, Spark 2.4.5) - Writing to CosmosDB, we are getting the same exception as below just before we do the final…
0
votes
1 answer

Spark Structured Streaming using spark-acid writeStream (with checkpoint) throwing org.apache.hadoop.fs.FileAlreadyExistsException

In our Spark app, we use Spark structured streaming. It uses Kafka as input stream, & HiveAcid as writeStream to Hive table. For HiveAcid, it is open source library called spark acid from qubole: https://github.com/qubole/spark-acid Below is our…