Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
8
votes
1 answer

Understanding Spark Structured Streaming Parallelism

I'm a newbie in the Spark world and struggling with some concepts. How does parallelism happen when using Spark Structured Streaming sourcing from Kafka ? Let's consider the following snippet code: SparkSession spark = SparkSession …
8
votes
3 answers

How to pivot streaming dataset?

I am trying to pivot a Spark streaming dataset (structured streaming) but I get an AnalysisException (excerpt below). Could someone confirm pivoting is indeed not supported in structured streams (Spark 2.0), perhaps suggest alternative…
8
votes
0 answers

Combining windowing (groupBy) and mapGroupsWithState (groupByKey) in Spark Structured Streaming

Currently using Spark 2.2.0 structured streaming. Given a stream of timestamped data with watermarking, is there a way to combine (1) the groupBy operation to achieve windowing by the timestamp field and other grouping criteria with (2) the…
tmiu
  • 331
  • 2
  • 7
8
votes
4 answers

Unresolved reference while trying to import col from pyspark.sql.functions in python 3.5

Refer to the post here: Spark structured streaming with python I would like to import 'col' in python 3.5 from pyspark.sql.functions import col However I got an error saying unresolved reference to col. I've installed pyspark library, so just…
8
votes
1 answer

How does Structured Streaming execute separate streaming queries (in parallel or sequentially)?

I'm writing a test application that consumes messages from Kafka's topcis and then push data into S3 and into RDBMS tables (flow is similar to presented here:…
mm112
  • 123
  • 4
8
votes
3 answers

Delete files after processing with Spark Structured Streaming

I am using the file source in Spark Structures Streaming and want to delete the files after I process them. I am reading in a directory filled with JSON files (1.json, 2.json, etc) and then writing them as Parquet files. I want to remove each file…
saul.shanabrook
  • 3,068
  • 3
  • 31
  • 49
8
votes
2 answers

Spark - Reading JSON from Partitioned Folders using Firehose

Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders…
7
votes
2 answers

How can I control the amount of files being processed for each trigger in Spark Structured Streaming using the "Trigger once" trigger?

I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the…
7
votes
1 answer

How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?

Based on the introduction in Spark 3.0, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. It should be possible to set "kafka.group.id" to track the offset. For our use case, I want to avoid the potential data loss if…
7
votes
2 answers

When is a Kafka connector preferred over a Spark streaming solution?

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer…
7
votes
1 answer

Structured Streaming output is not showing on Jupyter Notebook

I have two notebooks. First notebook is reading tweets from twitter using tweepy and writing it to a socket. Other notebook is reading tweets from that socket using spark structured streaming (Python) and writing it's result to console.…
7
votes
1 answer

What do these metrics mean for Spark Structured Streaming?

spark.streams.addListener(new StreamingQueryListener() { ...... override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = { println("Query made progress: " + queryProgress.progress) } ...... }) When…
Machi
  • 403
  • 2
  • 14
7
votes
3 answers

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zookeeper, kafka and create a new topic: /usr/local/kafka/bin/zookeeper-server-start.sh…
7
votes
1 answer

What's the purpose of OutputMode in flatMapGroupsWithState? How/where is it used?

I'm exploring KeyValueGroupedDataset.flatMapGroupsWithState for arbitrary stateful aggregation in Spark Structured Streaming. The signature of the KeyValueGroupedDataset.flatMapGroupsWithState operator is as follows: flatMapGroupsWithState[S:…
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
7
votes
1 answer

spark structured streaming exception : Append output mode not supported without watermark

I have performed a simple group by operation on year and do some aggregation as below. I tried to append the result to hdfs path as shown below. I am getting error saying, org.apache.spark.sql.AnalysisException: Append output mode not supported …
BigD
  • 850
  • 2
  • 17
  • 40