Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both sparkr and sparklyr) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

Understanding Spark Structured Streaming Parallelism

I'm a newbie in the Spark world and struggling with some concepts. How does parallelism happen when using Spark Structured Streaming sourcing from Kafka ? Let's consider the following snippet code: SparkSession spark = SparkSession …

apache-spark apache-spark-sql spark-structured-streaming

asked Jan 13 '18 at 12:42

Kleyson Rios

2,597
5
40
65

votes

3 answers

How to pivot streaming dataset?

I am trying to pivot a Spark streaming dataset (structured streaming) but I get an AnalysisException (excerpt below). Could someone confirm pivoting is indeed not supported in structured streams (Spark 2.0), perhaps suggest alternative…

apache-spark spark-structured-streaming apache-spark-2.0

asked Dec 01 '17 at 13:12

rodders

votes

0 answers

Combining windowing (groupBy) and mapGroupsWithState (groupByKey) in Spark Structured Streaming

Currently using Spark 2.2.0 structured streaming. Given a stream of timestamped data with watermarking, is there a way to combine (1) the groupBy operation to achieve windowing by the timestamp field and other grouping criteria with (2) the…

apache-spark spark-structured-streaming

asked Aug 04 '17 at 10:59

tmiu

votes

4 answers

Unresolved reference while trying to import col from pyspark.sql.functions in python 3.5

Refer to the post here: Spark structured streaming with python I would like to import 'col' in python 3.5 from pyspark.sql.functions import col However I got an error saying unresolved reference to col. I've installed pyspark library, so just…

python apache-spark pyspark apache-spark-sql spark-structured-streaming

asked Jul 28 '17 at 07:53

teddy

votes

1 answer

How does Structured Streaming execute separate streaming queries (in parallel or sequentially)?

I'm writing a test application that consumes messages from Kafka's topcis and then push data into S3 and into RDBMS tables (flow is similar to presented here:…

apache-spark spark-structured-streaming

asked May 14 '17 at 10:47

mm112

votes

3 answers

Delete files after processing with Spark Structured Streaming

I am using the file source in Spark Structures Streaming and want to delete the files after I process them. I am reading in a directory filled with JSON files (1.json, 2.json, etc) and then writing them as Parquet files. I want to remove each file…

apache-spark spark-structured-streaming

asked Apr 28 '17 at 04:01

saul.shanabrook

3,068
3
31
49

votes

2 answers

Spark - Reading JSON from Partitioned Folders using Firehose

Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders…

apache-spark apache-spark-sql databricks spark-structured-streaming

asked Oct 30 '16 at 20:20

Kurt Maile

1,171
3
13
29

votes

2 answers

How can I control the amount of files being processed for each trigger in Spark Structured Streaming using the "Trigger once" trigger?

I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the…

python apache-spark pyspark databricks spark-structured-streaming

asked Jul 12 '21 at 10:36

Mathias Bigler

votes

1 answer

How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?

Based on the introduction in Spark 3.0, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. It should be possible to set "kafka.group.id" to track the offset. For our use case, I want to avoid the potential data loss if…

scala apache-spark apache-kafka spark-structured-streaming spark-kafka-integration

asked Sep 22 '20 at 05:10

yyuankm

votes

2 answers

When is a Kafka connector preferred over a Spark streaming solution?

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer…

apache-spark apache-kafka apache-kafka-connect spark-structured-streaming

asked Jun 04 '20 at 12:18

MetallicPriest

29,191
52
200
356

votes

1 answer

Structured Streaming output is not showing on Jupyter Notebook

I have two notebooks. First notebook is reading tweets from twitter using tweepy and writing it to a socket. Other notebook is reading tweets from that socket using spark structured streaming (Python) and writing it's result to console.…

apache-spark pyspark jupyter-notebook spark-streaming spark-structured-streaming

asked Apr 27 '20 at 16:25

Abdul Haseeb

votes

1 answer

What do these metrics mean for Spark Structured Streaming?

spark.streams.addListener(new StreamingQueryListener() { ...... override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = { println("Query made progress: " + queryProgress.progress) } ...... }) When…

apache-spark spark-structured-streaming

asked Apr 07 '20 at 06:53

Machi

votes

3 answers

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zookeeper, kafka and create a new topic: /usr/local/kafka/bin/zookeeper-server-start.sh…

apache-spark pyspark apache-kafka apache-spark-sql spark-structured-streaming

asked Feb 14 '20 at 16:26

el abed houssem

votes

1 answer

What's the purpose of OutputMode in flatMapGroupsWithState? How/where is it used?

I'm exploring KeyValueGroupedDataset.flatMapGroupsWithState for arbitrary stateful aggregation in Spark Structured Streaming. The signature of the KeyValueGroupedDataset.flatMapGroupsWithState operator is as follows: flatMapGroupsWithState[S:…

apache-spark spark-structured-streaming

asked Jul 07 '19 at 11:33

Jacek Laskowski

72,696
27
242
420

votes

1 answer

spark structured streaming exception : Append output mode not supported without watermark

I have performed a simple group by operation on year and do some aggregation as below. I tried to append the result to hdfs path as shown below. I am getting error saying, org.apache.spark.sql.AnalysisException: Append output mode not supported …

apache-spark spark-structured-streaming

asked Jan 09 '19 at 20:37

BigD

Prev 1 2 3

…

99 100 Next