Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
0
votes
2 answers

Unable to use kafka jars on Jupyter notebook

I'm using spark structured streaming to read data from single node Kafka. Running below setup locally on mac. I can read via spark-submit, but does not work in Jupyter notebook. from pyspark.sql import SparkSession from pyspark.sql.functions import…
0
votes
1 answer

Spark continuous structured streaming not showing input rate or process rate metrics

I'm running my spark continuous structured streaming application on a standalone cluster. However I noticed that metrics like average input/sec or avg process/sec is not showing(as NaN) on the structured streaming UI. I have…
0
votes
0 answers

How to get notified when Spark streaming starts to process a file?

I make use of Spark to read streaming data from a folder and I want to get notified when it picks up my file and starts to process it so that I can print a log to indicate that the application is reading the file and this may take some time. I've…
dogukan
  • 393
  • 1
  • 4
  • 9
0
votes
0 answers

Handling changes in spark streaming pipelines

May I know the common/suggested practice when we need to perform re-ingestion in Spark structured streaming pipeline? For ex: any bug in consumer streaming code which reads from a queue. In such cases we as consumer reading from queue would need to…
0
votes
0 answers

spark local mode and standalone mode on single server/machine

I have a single server of 24 cores and 260 gb. Is using spark on local mode better option here or using standalone cluster mode is better to use here for speed? and what if on same server, I make 4 different containers and assign 6 cores and 65 gb…
0
votes
1 answer

Infer Schema Fails in Databricks Notebook

I have written a spark structured stream in Databricks. The first bit of code is to check if a delta table exists for my entity. If it does not then the delta table is created. Here, I wanted to use the infer schema option to get the schema for the…
0
votes
0 answers

Better ways to handle data corrections in Spark streaming

Spark structured streaming code needs to read the data from Kafka , perform deduplication check on a key and write to delta/target. For dedup planning to use watermark as mentioned here -…
0
votes
0 answers

How to split a stream in structured streaming without incurring a dual read from kafka

Our spark streaming app reads different types of events from a single global kafka topic and needs to do a join across two types of event streams. In structured streaming, we are noticing that splitting the input stream based on a filter condition…
0
votes
0 answers

Spark structured streaming performance regression in latency times reading/writing to kafka since 3.0.2

During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower latency times in spark structured streaming when reading and writing to kafka. I have tested using both CONTINUOUS and MICROBATCH. In simple read and write to kafka using…
0
votes
0 answers

Spark Structured Streaming Delta lake schema change

We are currently utilizing Delta as our data lake, with Spark applications utilizing its tables as sources and destinations in Spark streaming. All of this is deployed within a Kubernetes cluster, and we persist checkpoint data in Spark to handle…
0
votes
0 answers

Spark structured streaming - Stream - Static Join: How to update static DataFrame

My question is almost the same as this one: Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically However solution from @Michael Heil didn't for my code. Another similar question is: How can I update a broadcast…
0
votes
2 answers

Get dataframe rows with the latest timestamp (Spark structured streaming)

I have this dataframe : +------+-------------------+-----------+------------------------+------------------------+ |brand |original_timestamp |weight |arrival_timestamp |features …
0
votes
0 answers

Handling Deduplication and Historic Data correction - Structured streaming

Exploring to understand how to handle deduplication in spark structured streaming, when dealing with large volume of data and situations where watermark cannot be used. Situation: data is published to a queue from data provider. Spark structured…
steve
  • 129
  • 2
  • 9
0
votes
0 answers

Spark Structured streaming - Handling Deduplication

As per the doc https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication, we can handle duplicates using combination of withWatermark and dropDuplicates(). Question: using withWatermark() when we have…
0
votes
0 answers

How to Write Streaming data to Kafka topic on Confluent Cloud?

import os from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, FloatType, DateType sp = SparkSession.builder.config("spark.jars", os.getcwd() +…