Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both sparkr and sparklyr) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

Unable to use kafka jars on Jupyter notebook

I'm using spark structured streaming to read data from single node Kafka. Running below setup locally on mac. I can read via spark-submit, but does not work in Jupyter notebook. from pyspark.sql import SparkSession from pyspark.sql.functions import…

python apache-spark apache-kafka spark-structured-streaming

asked Aug 23 '23 at 15:04

steve

votes

1 answer

Spark continuous structured streaming not showing input rate or process rate metrics

I'm running my spark continuous structured streaming application on a standalone cluster. However I noticed that metrics like average input/sec or avg process/sec is not showing(as NaN) on the structured streaming UI. I have…

apache-spark spark-streaming spark-structured-streaming data-engineering

asked Aug 22 '23 at 15:44

XIAOAGE

votes

0 answers

How to get notified when Spark streaming starts to process a file?

I make use of Spark to read streaming data from a folder and I want to get notified when it picks up my file and starts to process it so that I can print a log to indicate that the application is reading the file and this may take some time. I've…

scala apache-spark spark-structured-streaming

asked Aug 22 '23 at 11:14

dogukan

votes

0 answers

Handling changes in spark streaming pipelines

May I know the common/suggested practice when we need to perform re-ingestion in Spark structured streaming pipeline? For ex: any bug in consumer streaming code which reads from a queue. In such cases we as consumer reading from queue would need to…

apache-spark spark-streaming spark-structured-streaming spark-streaming-kafka

asked Aug 22 '23 at 09:03

steve

votes

0 answers

spark local mode and standalone mode on single server/machine

I have a single server of 24 cores and 260 gb. Is using spark on local mode better option here or using standalone cluster mode is better to use here for speed? and what if on same server, I make 4 different containers and assign 6 cores and 65 gb…

apache-spark pyspark apache-spark-sql spark-streaming spark-structured-streaming

asked Aug 22 '23 at 07:02

mahak tirole

votes

1 answer

Infer Schema Fails in Databricks Notebook

I have written a spark structured stream in Databricks. The first bit of code is to check if a delta table exists for my entity. If it does not then the delta table is created. Here, I wanted to use the infer schema option to get the schema for the…

python databricks spark-streaming azure-databricks spark-structured-streaming

asked Aug 21 '23 at 22:01

Shoaib Maroof

votes

0 answers

Better ways to handle data corrections in Spark streaming

Spark structured streaming code needs to read the data from Kafka , perform deduplication check on a key and write to delta/target. For dedup planning to use watermark as mentioned here -…

apache-spark pyspark spark-structured-streaming

asked Aug 21 '23 at 12:40

user16798185

votes

0 answers

How to split a stream in structured streaming without incurring a dual read from kafka

Our spark streaming app reads different types of events from a single global kafka topic and needs to do a join across two types of event streams. In structured streaming, we are noticing that splitting the input stream based on a filter condition…

apache-spark spark-structured-streaming

asked Aug 17 '23 at 19:18

user22406969

votes

0 answers

Spark structured streaming performance regression in latency times reading/writing to kafka since 3.0.2

During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower latency times in spark structured streaming when reading and writing to kafka. I have tested using both CONTINUOUS and MICROBATCH. In simple read and write to kafka using…

performance apache-kafka spark-structured-streaming latency

asked Aug 16 '23 at 11:36

AShaw

votes

0 answers

Spark Structured Streaming Delta lake schema change

We are currently utilizing Delta as our data lake, with Spark applications utilizing its tables as sources and destinations in Spark streaming. All of this is deployed within a Kubernetes cluster, and we persist checkpoint data in Spark to handle…

apache-spark spark-streaming spark-structured-streaming delta-lake

asked Aug 12 '23 at 11:49

dhia Gharsallaoui

votes

0 answers

Spark structured streaming - Stream - Static Join: How to update static DataFrame

My question is almost the same as this one: Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically However solution from @Michael Heil didn't for my code. Another similar question is: How can I update a broadcast…

apache-spark pyspark spark-structured-streaming

asked Aug 11 '23 at 09:40

Viperl

votes

2 answers

Get dataframe rows with the latest timestamp (Spark structured streaming)

dataframe scala apache-spark apache-spark-sql spark-structured-streaming

asked Aug 09 '23 at 18:02

Nab

votes

0 answers

Handling Deduplication and Historic Data correction - Structured streaming

Exploring to understand how to handle deduplication in spark structured streaming, when dealing with large volume of data and situations where watermark cannot be used. Situation: data is published to a queue from data provider. Spark structured…

apache-spark pyspark spark-structured-streaming

asked Aug 09 '23 at 16:51

steve

votes

0 answers

Spark Structured streaming - Handling Deduplication

As per the doc https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication, we can handle duplicates using combination of withWatermark and dropDuplicates(). Question: using withWatermark() when we have…

apache-spark spark-structured-streaming

asked Aug 07 '23 at 16:45

user16798185

votes

0 answers

How to Write Streaming data to Kafka topic on Confluent Cloud?

import os from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, FloatType, DateType sp = SparkSession.builder.config("spark.jars", os.getcwd() +…

apache-spark pyspark apache-kafka spark-structured-streaming confluent-cloud

asked Aug 06 '23 at 08:56

ARKHAN

Prev 1 2 3

…

99 100 Next