Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
9
votes
2 answers

how to calculate aggregations on a window when sensor readings are not sent if they haven't changed since last event?

How can I calculate aggregations on a window, from a sensor when new events are only sent if the sensor value has changed since the last event? The sensor readings are taken at fixed times, e.g. every 5 seconds, but are only forwarded if the…
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
9
votes
1 answer

Structured streaming with periodically updated static dataset

Merging streaming with static datasets is a great feature of structured streaming. But on every batch the datasets will be refreshed from the datasources. Since these sources are not always that dynamic it would be a performance gain to cache a…
Chris
  • 523
  • 4
  • 11
9
votes
7 answers

Why does starting a streaming query lead to "ExitCodeException exitCode=-1073741515"?

Been trying to get used to the new structured streaming but it keeps giving me below error as soon as I start a .writeStream query. Any idea what could be causing this? Closest I could find was an ongoing Spark bug if you split checkpoint and…
Trisivieta
  • 91
  • 1
  • 1
  • 3
9
votes
2 answers

How to write streaming Dataset to Cassandra?

So I have a Python Stream-sourced DataFrame df that has all the data I want to place into a Cassandra table with the spark-cassandra-connector. I've tried doing this in two ways: df.write \ .format("org.apache.spark.sql.cassandra") \ …
9
votes
3 answers

How to display a streaming DataFrame (as show fails with AnalysisException)?

So I have some data I'm stream in a Kafka topic, I'm taking this streaming data and placing it into a DataFrame. I want to display the data inside of the DataFrame: import os from kafka import KafkaProducer from pyspark.sql import SparkSession,…
9
votes
5 answers

Apache Spark (Structured Streaming) : S3 Checkpoint support

From the spark structured streaming documentation: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query." And sure enough, setting the checkpoint to…
Apurva
  • 153
  • 2
  • 7
9
votes
4 answers

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the…
Tal Joffe
  • 5,347
  • 4
  • 25
  • 31
8
votes
3 answers

How to use foreach or foreachBatch in PySpark to write to database?

I want to do Spark Structured Streaming (Spark 2.4.x) from a Kafka source to a MariaDB with Python (PySpark). I want to use the streamed Spark dataframe and not the static nor Pandas dataframe. It seems that one has to use foreach or foreachBatch…
tardis
  • 1,280
  • 3
  • 23
  • 48
8
votes
2 answers

Spark Structured Streaming with Kafka doesn't honor startingOffset="earliest"

I've set up Spark Structured Streaming (Spark 2.3.2) to read from Kafka (2.0.0). I'm unable to consume from the beginning of the topic if messages entered the topic before Spark streaming job is started. Is this expected behavior of Spark streaming…
8
votes
1 answer

How to use fully formed SQL with spark structured streaming

Documentation for Spark structured streaming says that - as of spark 2.3 all methods on the spark context available for static DataFrame/DataSet's are also available for use with structured streaming DataFrame/DataSet's as well. However I have yet…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
8
votes
2 answers

Amazon EMR and Spark streaming

Amazon EMR, Apache Spark 2.3, Apache Kafka, ~10 mln records per day. Apache Spark used for processing events in batches by 5 minutes, once per day worker nodes are dying and AWS reprovision automatically the nodes. On reviewing the log messages it…
8
votes
2 answers

Outer join two Datasets (not DataFrames) in Spark Structured Streaming

I have some code that joins two streaming DataFrames and outputs to console. val dataFrame1 = df1Input.withWatermark("timestamp", "40 seconds").as("A") val dataFrame2 = df2Input.withWatermark("timestamp", "40 seconds").as("B") val finalDF:…
8
votes
1 answer

Combining/Updating Cassandra Queried data to Structured Streaming receieved from Kafka

I'm creating a Spark Structured streaming application which is going to be calculating data received from Kafka every 10 seconds. To be able to do some of the calculations, I need to look up some information about sensors and placement in a…
Martin
  • 352
  • 3
  • 15
8
votes
2 answers

Spark Structure Streaming fail duo to checkpoint file not found

I am running spark structured streaming on a test env. It happens from time to time that the job fail duo to some checkpoint file is not found. One reason might be that the kafka topic has a very short retention time. But I've added…
8
votes
1 answer

How to write streaming dataset into Hive?

Using Apache Spark 2.2: Structured Streaming, I am creating a program which reads data from Kafka and write it to Hive. I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. Hive Table Created: CREATE TABLE demo_user( timeaa…