Highest Voted 'spark-checkpoint' Questions

1

vote

1 answer

Using checkpointed dataframe to overwrite table fails with FileNotFoundException

I have some dataframe df in pySpark, which results from calling: df = spark.sql("select A, B from org_table") df = df.stuffIdo I want to overwrite org_table at the end of my script. Since overwriting input-tabels is forbidden, I checkpointed my…

asked Jun 27 '19 at 11:32

Markus

2,265
5
28
54

1

vote

0 answers

checkpoint variables used in Spark driver

I am streaming data from Kafka , and also maintaining state in my application (by using updateStateByKey) , and so i mandatorily need to checkpoint my data. This is working well. In addition to data from kafka, i am also using some local variables…

apache-spark apache-kafka spark-streaming spark-checkpoint

asked Feb 09 '18 at 15:24

Amanpreet Khurana

549
1
5
17

0

votes

0 answers

clean checkpoint state files of spark stateful structured streaming

I struggle to find a solution for cleaning the checkpoint state files whose number grows overtime after I start a spark stateful structured streaming which ends up take up a lot of disk space. When saying checkpoint state file I mean the delta and…

apache-spark spark-structured-streaming spark-checkpoint

asked Jul 03 '23 at 11:41

lize su

11
5

0

votes

0 answers

Using GCS bucket for checkpoints in Spark Structured Streaming

We are performing a POC to run a Spark Structured Streaming on GKE (using spark-operator) and we plan to store our checkpoints in GCS. From the GCS documentation, it seems that having the storage bucket within the same location as GKE with Location…

apache-spark google-kubernetes-engine spark-structured-streaming gcs spark-checkpoint

asked Mar 10 '23 at 18:45

it243

71
7

0

votes

0 answers

Pre-populate a bronze delta table from a silver table using a batch job, then stream to it from the same table

I have a pipeline like this: kafka->bronze->silver The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming. I changed the silver schema, so I want to reload from the bronze into…

apache-spark spark-structured-streaming delta-lake spark-checkpoint

asked Feb 15 '23 at 01:57

user961826

564
6
14

0

votes

0 answers

Disadvantages of streaming from Parquet source in Spark Structured Streaming

What are the potential disadvantages (if any) of streaming micro-batches from a HDFS/S3 backed parquet files as against standard sources like Kafka for a long running Spark Structured Streaming job?

apache-spark hdfs parquet spark-structured-streaming spark-checkpoint

asked Oct 28 '22 at 15:32

John Subas

81
1
11

0

votes

0 answers

Spark S3 Checkpointing error after enabling RocksDb

We are running Spark Streaming state based application on OpenShift cluster. We are using Amazon S3 for checkpointing. Rocks DB has also been enabled using configuration - "spark.sql.streaming.stateStore.providerClass" =>…

amazon-s3 spark-streaming rocksdb spark-checkpoint

asked Oct 03 '22 at 14:22

Varun Arora

1

0

votes

0 answers

timeout of 60000ms expired before the position for partition could be determined

I am using structured streaming with subscribepattern with a checkpoint location. If I just delete a topic, the stream updates metadata and everything looks fine. But if topic is deleted while data is published to that topic and stream is running…

apache-spark spark-streaming apache-kafka-streams spark-structured-streaming spark-checkpoint

asked Sep 21 '22 at 13:57

notesdvi

1
1

0

votes

1 answer

offset management in spark streaming

As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times)…

apache-spark apache-kafka spark-streaming spark-streaming-kafka spark-checkpoint

asked May 15 '22 at 19:43

Gaurav Gupta

159
1
17

0

votes

1 answer

How to reduce number of checkpoint files writen by spark streaming

If spark streaming job involves shuffle and stateful processing, it's easy to generate lots of small files per micro batch. We should decrease the number of files without hurting latency.

apache-spark spark-structured-streaming spark-checkpoint

asked Feb 08 '22 at 01:06

Warren Zhu

1,355
11
12

0

votes

2 answers

Apache Spark Structured Streaming - not writing to checkpoint location

i have a simple Apache Spark Structured Streaming python code, which reads data from Kafka, and writes the messages to console. i've setup checkpoint location, however the code is not writing to checkpoint.. any ideas why ? Here is the code: from…

apache-spark apache-kafka spark-structured-streaming spark-checkpoint

asked Dec 09 '21 at 20:18

Karan Alang

869
2
10
35

0

votes

1 answer

Spark structures streaming too many threads with checkpointing on S3

Spark 3.0.1 hadoop-aws 3.2.0 I have a simple spark streaming application that reads messages from Kafka topic, aggregates them and writes into Elasticsearch. I am using checkpointing and an S3 bucket to store them. After some time application…

apache-spark amazon-s3 spark-structured-streaming spark-checkpoint

asked Mar 04 '21 at 08:40

Andrii Pohrebniak

1
1

0

votes

1 answer

Sliding Window without watermark in Apache Spark?

Considering I have a simple aggregation with a window defined without any watermark say. df .groupBy(window(col("time"), "30 minutes","10 minutes").as("time")) .aggr .... Here as our window is 30 minutes, and a sliding interval of 10 minutes Q1.…

scala apache-spark spark-structured-streaming spark-streaming-kafka spark-checkpoint

asked Jan 26 '21 at 17:30

supernatural

1,107
11
34

0

votes

0 answers

Structured Streaming in Databricks Azure throwing exception - java.lang.IllegalStateException: Error reading delta file dbfs:/raw_zone/1.delta

We are using Structured Streaming in Databricks environment, Every time while we run this program - kAFKA - Structured Streaming (DBR6.6, Spark 2.4.5) - Writing to CosmosDB, we are getting the same exception as below just before we do the final…

apache-spark-sql spark-streaming spark-structured-streaming azure-databricks spark-checkpoint

asked Nov 24 '20 at 09:27

Vishnu

41
3

0

votes

1 answer

Spark Structured Streaming using spark-acid writeStream (with checkpoint) throwing org.apache.hadoop.fs.FileAlreadyExistsException

In our Spark app, we use Spark structured streaming. It uses Kafka as input stream, & HiveAcid as writeStream to Hive table. For HiveAcid, it is open source library called spark acid from qubole: https://github.com/qubole/spark-acid Below is our…

apache-spark spark-structured-streaming qubole spark-hive spark-checkpoint

asked May 22 '20 at 06:56

Shuwn Yuan Tee

5,578
6
28
42

Questions tagged [spark-checkpoint]