Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
15
votes
1 answer

How to read records in JSON format from Kafka using Structured Streaming?

I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka. I use: Spark 2.10 Kafka 0.10 spark-sql-kafka-0-10 Spark Kafka DataSource has defined underlying schema:…
14
votes
1 answer

How to specify batch interval in Spark Structured Streaming?

I am going through Spark Structured Streaming and encountered a problem. In StreamingContext, DStreams, we can define a batch interval as follows : from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 5) # 5 second batch…
dev
  • 732
  • 2
  • 8
  • 29
14
votes
4 answers

How to set group.id for consumer group in kafka data source in Structured Streaming?

I want to use Spark Structured Streaming to read from a secure kafka. This means that I will need to force a specific group.id. However, as is stated in the documentation this is not possible. Still, in the databricks documentation…
14
votes
2 answers

Spark Structured Streaming Checkpoint Cleanup

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don't understand what will happen in a couple situations. If my streaming app runs for a long time will…
torpedoted
  • 223
  • 3
  • 6
14
votes
4 answers

How to create a custom streaming data source?

I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming. How to create a streaming data source in Spark Structured Streaming?
szu
  • 932
  • 1
  • 9
  • 22
14
votes
2 answers

Spark structured streaming - join static dataset with streaming dataset

I'm using Spark structured streaming to process records read from Kafka. Here's what I'm trying to achieve: (a) Each record is a Tuple2 of type (Timestamp, DeviceId). (b) I've created a static Dataset[DeviceId] which contains the set of all valid…
14
votes
0 answers

Why does a single structured query run multiple SQL queries per batch?

Why does the following structured query run multiple SQL queries as can be seen in web UI's SQL tab? import org.apache.spark.sql.streaming.{OutputMode, Trigger} import scala.concurrent.duration._ val rates = spark. readStream. format("rate"). …
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
14
votes
1 answer

Apache Spark Structured Streaming vs Apache Flink: what is the difference?

We have discussed the questions below: What is the difference between Apache Spark and Apache Flink? [closed] What does “streaming” mean in Apache Spark and Apache Flink? What is the difference between mini-batch vs real time streaming in practice…
14
votes
1 answer

Monitoring Structured Streaming

I have a structured stream set up that is running just fine, but I was hoping to monitor it while it is running. I have built an EventCollector class EventCollector extends StreamingQueryListener{ override def onQueryStarted(event:…
Leyth G
  • 1,103
  • 2
  • 15
  • 38
13
votes
3 answers

IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

We have Streaming Application implemented using Spark Structured Streaming which tries to read data from Kafka topics and write it to HDFS Location. Sometimes application fails with Exception: _spark_metadata/0 doesn't exist while compacting batch…
13
votes
3 answers

How to insert spark structured streaming DataFrame to Hive external table/location?

One query on spark structured streaming integration with HIVE table. I have tried to do some examples of spark structured streaming. here is my example val spark =SparkSession.builder().appName("StatsAnalyzer") .enableHiveSupport() …
BigD
  • 850
  • 2
  • 17
  • 40
13
votes
3 answers

Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist

I'm building a Kafka ingest module in EMR 5.11.1, Spark 2.2.1. My intention is to use Structured Streaming to consume from a Kafka topic, do some processing, and store to EMRFS/S3 in parquet format. Console sink works as expected, file sink does not…
13
votes
1 answer

Spark Structured Streaming ForeachWriter and database performance

I've had a go implementing a structured stream like so... myDataSet .map(r => StatementWrapper.Transform(r)) .writeStream .foreach(MyWrapper.myWriter) .start() .awaitTermination() This all seems to work, but looking at the throughput of…
Exie
  • 466
  • 5
  • 16
13
votes
3 answers

Using Spark Structured Streaming with Trigger.Once

There is a data lake of CSV files that's updated throughout the day. I'm trying to create a Spark Structured Streaming job with the Trigger.Once feature outlined in this blog post to periodically write the new data that's been written to the CSV…
Powers
  • 18,150
  • 10
  • 103
  • 108
13
votes
3 answers

How to write JDBC Sink for Spark Structured Streaming [SparkException: Task not serializable]?

I need a JDBC sink for my spark structured streaming data frame. At the moment, as far as I know DataFrame’s API lacks writeStream to JDBC implementation (neither in PySpark nor in Scala (current Spark version 2.2.0)). The only suggestion I found…
Lukiz
  • 175
  • 1
  • 9