Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
13
votes
1 answer

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark... note: the topic is written…
13
votes
1 answer

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream …
12
votes
1 answer

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1) Right…
Victor
  • 2,450
  • 2
  • 23
  • 54
12
votes
3 answers

Is proper event-time sessionization possible with Spark Structured Streaming?

Been playing around Spark Structured Streaming and mapGroupsWithState (specifically following the StructuredSessionization example in the Spark source). I want to confirm some limitations I believe exist with mapGroupsWithState given my use case. A…
12
votes
1 answer

SparkStreaming: avoid checkpointLocation check

I'm writing a library to integrate Apache Spark with a custom environment. I'm implementing both custom streaming sources and streaming writers. Some of the sources I'm developing are not recoverable, at least after a crash of the application. If an…
12
votes
1 answer

Structured streaming - Metrics in Grafana

I am using structured streaming to read data from Kafka and create various aggregate metrics. I have enabled Graphite sink using metrics.properties. I have seen applications in older Spark version have streaming related metrics. I don't see…
12
votes
5 answers

Stop Structured Streaming query gracefully

I'm using Spark 2.1 and trying to stop a Streaming query gracefully. Is StreamingQuery.stop() a graceful stop because I haven't seen any detailed information on this method in the documentation: void stop() Stops the execution of this query if it…
shiv455
  • 7,384
  • 19
  • 54
  • 93
12
votes
1 answer

Spark Structured streaming: multiple sinks

We are consuming from Kafka using structured streaming and writing the processed data set to s3. We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1) In…
user2221654
  • 311
  • 1
  • 7
  • 20
12
votes
1 answer

Structured Streaming exception when using append output mode with watermark

Despite the fact that I'm using withWatermark(), I'm getting the following error message when I run my spark job: Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming…
Ray J
  • 805
  • 1
  • 9
  • 13
11
votes
1 answer

Spark Structured Streaming with Kafka SASL/PLAIN authentication

Is there a way of connecting a Spark Structured Streaming Job to a Kafka cluster which is secured by SASL/PLAIN authentication? I was thinking about something similar to: val df2 = spark.read.format("kafka") .option("kafka.bootstrap.servers",…
user152468
  • 3,202
  • 6
  • 27
  • 57
11
votes
2 answers

Pass additional arguments to foreachBatch in pyspark

I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer function by adding an additional argument for table…
11
votes
1 answer

How many Kafka consumers does a streaming query use for execution?

I was surprised to see that Spark consumes the data from Kafka with only one Kafka consumer, and this consumer runs within the driver container. I rather expected to see, that Spark creates as many consumers as the number of partitions in the topic,…
tashoyan
  • 418
  • 1
  • 3
  • 12
11
votes
1 answer

How to manually set group.id and commit kafka offsets in spark structured streaming?

I was going through the Spark structured streaming - Kafka integration guide here. It is told at this link that enable.auto.commit: Kafka source doesn’t commit any offset. So how do I manually commit offsets once my spark application has…
11
votes
1 answer

Spark structured streaming - update data frame's schema on the fly

I have a simple structured streaming job which monitors a directory for CSV files and writes parquet files - no transformation in between. The job starts by building a data frame from reading CSV files using readStream(), with a schema which I get…
11
votes
2 answers

Error: java.lang.IllegalArgumentException: Option 'basePath' must be a directory

Based on book available in https://github.com/jaceklaskowski/spark-structured-streaming-book/blob/master/spark-structured-streaming.adoc, I'm trying to play with Spark Structured Streaming using the spark-shell, but struggling to get it working. My…
Kleyson Rios
  • 2,597
  • 5
  • 40
  • 65