Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
7
votes
2 answers

What is the purpose of StreamingQuery.awaitTermination?

I have a Spark Structured Streaming job, it reads the offsets from a Kafka topic and writes it to the aerospike database. Currently I am in the process making this job production ready and implementing the SparkListener. While going to through the…
Himanshu Yadav
  • 13,315
  • 46
  • 162
  • 291
7
votes
1 answer

Spark watermark and windowing in Append mode

Below structured streaming code watermarks and windows data over 24 hour interval in 15 minute slides. Code produces only empty Batch 0 in Append mode. In Update mode results are correctly displayed. Append mode is needed because S3 sink works only…
dejan
  • 196
  • 2
  • 11
7
votes
0 answers

How to preserve event order per key in Structured Streaming Repartitioning By Key?

I want to write a structured spark streaming Kafka consumer which reads data from a one partition Kafka topic, repartitions the incoming data by "key" to 3 spark partitions while keeping the messages ordered per key, and writes them to another Kafka…
7
votes
1 answer

KryoException: Unable to find class with spark structured streaming

1-The Problem I have a Spark program that make use of Kryo but not as part of the Spark Mechanics. More specifically I am using Spark Structured Streaming connected to Kafka. I read binary values coming from Kafka and decode it on my own. I am…
MaatDeamon
  • 9,532
  • 9
  • 60
  • 127
7
votes
2 answers

Spark Strucutured Streaming Window on non-timestamp column

I am getting a data stream of the form: +--+---------+---+----+ |id|timestamp|val|xxx | +--+---------+---+----+ |1 |12:15:25 | 50| 1 | |2 |12:15:25 | 30| 1 | |3 |12:15:26 | 30| 2 | |4 |12:15:27 | 50| 2 | |5 |12:15:27 | 30| 3 | |6 |12:15:27 |…
7
votes
1 answer

Spark Structured Streaming Writestream to Hive ORC Partioned External Table

I am trying to use Spark Structured Streaming - writeStream API to write to an External Partitioned Hive table. CREATE EXTERNAL TABLE `XX`( `a` string, `b` string, `b` string, `happened` timestamp, `processed` timestamp, `d` string, `e` string, `f`…
7
votes
1 answer

Watermarking for Spark structured streaming with three way joins

I have 3 streams of data: foo, bar and baz. There's a necessity to join these streams with LEFT OUTER JOIN in a following chain: foo -> bar -> baz. Here's an attempt to mimic these streams with built-in rate stream: val rateStream =…
ChernikovP
  • 471
  • 1
  • 8
  • 18
7
votes
0 answers

Restarting Spark structured streaming query on exception or termination

What's the right way of programmatically restarting a structured streaming query which has terminated due to an exception? Example code or reference would be appreciated. Could it be done from within the onQueryTerminated() event handler of…
7
votes
1 answer

Structured Streaming and Splitting nested data into multiple datasets

I'm working with Spark's Structured Streaming (2.2.1), using Kafka to receive data from sensors every 60 seconds. I'm having troubles wrapping my head around how to package this Kafka Data to be able to process is correctly as it comes. I need to be…
7
votes
1 answer

Simulate Lag Function - Spark structured streaming

I'm using Spark Structured Streaming to analyze sensor data and need to perform calculations based on a sensors previous timestamp. My incoming data stream has three columns: sensor_id, timestamp, and temp. I need to add a fourth column that is that…
7
votes
1 answer

jsontostructs to Row in spark structured streaming

I'm using Spark 2.2 and i'm trying to read the JSON messages from Kafka, transform them to DataFrame and have them as a Row: spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") …
7
votes
2 answers

Structured streaming : watermark vs. exactly-once semantics

The programming guide says that structured streaming guarantees end-to-end exactly once semantics using appropriate sources/sinks. However I'm not understanding how this works when the job crashes and we have a watermark applied. Below is an example…
Ray J
  • 805
  • 1
  • 9
  • 13
7
votes
1 answer

Structured Streaming - Foreach Sink

I am basically reading from a Kafka source, and dumping each message through to my foreach processor (Thanks Jacek's page for the simple example). If this actually works, i shall actually perform some business logic in the process method here,…
Raghav
  • 2,128
  • 5
  • 27
  • 46
7
votes
2 answers

How to deserialize records from Kafka using Structured Streaming in Java?

I use Spark 2.1. I am trying to read records from Kafka using Spark Structured Streaming, deserialize them and apply aggregations afterwards. I have the following code: SparkSession spark = SparkSession .builder() …
dchar
  • 1,665
  • 2
  • 19
  • 28
6
votes
1 answer

Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically

I am building a Spark Structured Streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically. So, I am planning to do a persist/unpersist of that batch data periodically. Below is a sample…