Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both sparkr and sparklyr) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

How to read records in JSON format from Kafka using Structured Streaming?

I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka. I use: Spark 2.10 Kafka 0.10 spark-sql-kafka-0-10 Spark Kafka DataSource has defined underlying schema:…

scala apache-spark apache-kafka apache-spark-sql spark-structured-streaming

asked Apr 08 '17 at 17:05

Stefan Repcek

2,553
4
21
29

votes

1 answer

How to specify batch interval in Spark Structured Streaming?

I am going through Spark Structured Streaming and encountered a problem. In StreamingContext, DStreams, we can define a batch interval as follows : from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 5) # 5 second batch…

apache-spark pyspark spark-structured-streaming

asked Sep 02 '19 at 17:02

dev

votes

4 answers

How to set group.id for consumer group in kafka data source in Structured Streaming?

I want to use Spark Structured Streaming to read from a secure kafka. This means that I will need to force a specific group.id. However, as is stated in the documentation this is not possible. Still, in the databricks documentation…

apache-spark apache-kafka spark-structured-streaming spark-kafka-integration

asked Mar 26 '19 at 10:52

Panagiotis Fytas

votes

2 answers

Spark Structured Streaming Checkpoint Cleanup

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don't understand what will happen in a couple situations. If my streaming app runs for a long time will…

apache-spark spark-structured-streaming

asked Jan 13 '18 at 00:55

torpedoted

votes

4 answers

How to create a custom streaming data source?

I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming. How to create a streaming data source in Spark Structured Streaming?

apache-spark spark-structured-streaming

asked Dec 02 '17 at 03:10

szu

votes

2 answers

Spark structured streaming - join static dataset with streaming dataset

I'm using Spark structured streaming to process records read from Kafka. Here's what I'm trying to achieve: (a) Each record is a Tuple2 of type (Timestamp, DeviceId). (b) I've created a static Dataset[DeviceId] which contains the set of all valid…

scala apache-spark apache-spark-sql apache-spark-dataset spark-structured-streaming

asked Oct 02 '17 at 22:04

jithinpt

1,204
2
16
33

votes

0 answers

Why does a single structured query run multiple SQL queries per batch?

Why does the following structured query run multiple SQL queries as can be seen in web UI's SQL tab? import org.apache.spark.sql.streaming.{OutputMode, Trigger} import scala.concurrent.duration._ val rates = spark. readStream. format("rate"). …

apache-spark spark-structured-streaming

asked Sep 11 '17 at 18:25

Jacek Laskowski

72,696
27
242
420

votes

1 answer

Apache Spark Structured Streaming vs Apache Flink: what is the difference?

We have discussed the questions below: What is the difference between Apache Spark and Apache Flink? [closed] What does “streaming” mean in Apache Spark and Apache Flink? What is the difference between mini-batch vs real time streaming in practice…

apache-spark apache-flink spark-structured-streaming

asked Sep 01 '17 at 07:53

ShuMing Li

votes

1 answer

Monitoring Structured Streaming

I have a structured stream set up that is running just fine, but I was hoping to monitor it while it is running. I have built an EventCollector class EventCollector extends StreamingQueryListener{ override def onQueryStarted(event:…

scala apache-spark spark-structured-streaming

asked Dec 02 '16 at 17:16

Leyth G

1,103
2
15
38

votes

3 answers

IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

We have Streaming Application implemented using Spark Structured Streaming which tries to read data from Kafka topics and write it to HDFS Location. Sometimes application fails with Exception: _spark_metadata/0 doesn't exist while compacting batch…

scala apache-spark apache-kafka spark-structured-streaming

asked May 31 '19 at 07:20

Azim Kangda

votes

3 answers

How to insert spark structured streaming DataFrame to Hive external table/location?

One query on spark structured streaming integration with HIVE table. I have tried to do some examples of spark structured streaming. here is my example val spark =SparkSession.builder().appName("StatsAnalyzer") .enableHiveSupport() …

apache-spark hive spark-structured-streaming

asked Dec 28 '18 at 20:00

BigD

votes

3 answers

Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist

I'm building a Kafka ingest module in EMR 5.11.1, Spark 2.2.1. My intention is to use Structured Streaming to consume from a Kafka topic, do some processing, and store to EMRFS/S3 in parquet format. Console sink works as expected, file sink does not…

apache-spark amazon-s3 amazon-emr spark-structured-streaming

asked Apr 09 '18 at 20:59

maverik

votes

1 answer

Spark Structured Streaming ForeachWriter and database performance

I've had a go implementing a structured stream like so... myDataSet .map(r => StatementWrapper.Transform(r)) .writeStream .foreach(MyWrapper.myWriter) .start() .awaitTermination() This all seems to work, but looking at the throughput of…

database scala apache-spark jdbc spark-structured-streaming

asked Oct 18 '17 at 22:22

Exie

votes

3 answers

Using Spark Structured Streaming with Trigger.Once

There is a data lake of CSV files that's updated throughout the day. I'm trying to create a Spark Structured Streaming job with the Trigger.Once feature outlined in this blog post to periodically write the new data that's been written to the CSV…

scala apache-spark spark-structured-streaming

asked Aug 16 '17 at 04:38

Powers

18,150
10
103
108

votes

3 answers

How to write JDBC Sink for Spark Structured Streaming [SparkException: Task not serializable]?

I need a JDBC sink for my spark structured streaming data frame. At the moment, as far as I know DataFrame’s API lacks writeStream to JDBC implementation (neither in PySpark nor in Scala (current Spark version 2.2.0)). The only suggestion I found…

scala apache-spark spark-structured-streaming

asked Jul 28 '17 at 12:39

Lukiz

Prev 1

…

99 100 Next