Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both sparkr and sparklyr) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark... note: the topic is written…

scala apache-spark apache-kafka apache-kafka-connect spark-structured-streaming

asked Feb 28 '17 at 10:54

carlos rodrigues

votes

1 answer

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream …

scala apache-spark apache-spark-sql apache-spark-2.0 spark-structured-streaming

asked Feb 06 '17 at 07:07

Martin Brisiak

3,872
12
37
51

votes

1 answer

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1) Right…

hadoop impala spark-structured-streaming cloudera-cdh

asked Feb 06 '20 at 08:24

Victor

2,450
2
23
54

votes

3 answers

Is proper event-time sessionization possible with Spark Structured Streaming?

Been playing around Spark Structured Streaming and mapGroupsWithState (specifically following the StructuredSessionization example in the Spark source). I want to confirm some limitations I believe exist with mapGroupsWithState given my use case. A…

apache-spark apache-spark-sql spark-structured-streaming

asked Aug 12 '18 at 15:56

Mike Sukmanowsky

3,481
3
24
31

votes

1 answer

SparkStreaming: avoid checkpointLocation check

I'm writing a library to integrate Apache Spark with a custom environment. I'm implementing both custom streaming sources and streaming writers. Some of the sources I'm developing are not recoverable, at least after a crash of the application. If an…

java scala apache-spark spark-streaming spark-structured-streaming

asked Jun 19 '18 at 21:07

alexz00

votes

1 answer

Structured streaming - Metrics in Grafana

I am using structured streaming to read data from Kafka and create various aggregate metrics. I have enabled Graphite sink using metrics.properties. I have seen applications in older Spark version have streaming related metrics. I don't see…

apache-spark apache-spark-sql graphite spark-structured-streaming

asked Dec 19 '17 at 10:51

passionate

votes

5 answers

Stop Structured Streaming query gracefully

I'm using Spark 2.1 and trying to stop a Streaming query gracefully. Is StreamingQuery.stop() a graceful stop because I haven't seen any detailed information on this method in the documentation: void stop() Stops the execution of this query if it…

apache-spark spark-structured-streaming

asked Aug 16 '17 at 15:21

shiv455

7,384
19
54
93

votes

1 answer

Spark Structured streaming: multiple sinks

We are consuming from Kafka using structured streaming and writing the processed data set to s3. We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1) In…

apache-spark spark-structured-streaming

asked Aug 11 '17 at 19:57

user2221654

votes

1 answer

Structured Streaming exception when using append output mode with watermark

Despite the fact that I'm using withWatermark(), I'm getting the following error message when I run my spark job: Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming…

java apache-spark spark-structured-streaming

asked Aug 08 '17 at 21:02

Ray J

votes

1 answer

Spark Structured Streaming with Kafka SASL/PLAIN authentication

Is there a way of connecting a Spark Structured Streaming Job to a Kafka cluster which is secured by SASL/PLAIN authentication? I was thinking about something similar to: val df2 = spark.read.format("kafka") .option("kafka.bootstrap.servers",…

apache-spark apache-kafka spark-structured-streaming

asked Apr 28 '20 at 13:40

user152468

3,202
6
27
57

votes

2 answers

Pass additional arguments to foreachBatch in pyspark

I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer function by adding an additional argument for table…

apache-spark pyspark spark-structured-streaming databricks

asked May 03 '19 at 16:12

nstudenski

votes

1 answer

How many Kafka consumers does a streaming query use for execution?

I was surprised to see that Spark consumes the data from Kafka with only one Kafka consumer, and this consumer runs within the driver container. I rather expected to see, that Spark creates as many consumers as the number of partitions in the topic,…

apache-kafka spark-structured-streaming

asked Dec 04 '18 at 02:57

tashoyan

votes

1 answer

How to manually set group.id and commit kafka offsets in spark structured streaming?

I was going through the Spark structured streaming - Kafka integration guide here. It is told at this link that enable.auto.commit: Kafka source doesn’t commit any offset. So how do I manually commit offsets once my spark application has…

apache-spark apache-kafka spark-structured-streaming spark-kafka-integration

asked Jun 13 '18 at 19:01

user3243499

2,953
6
33
75

votes

1 answer

Spark structured streaming - update data frame's schema on the fly

I have a simple structured streaming job which monitors a directory for CSV files and writes parquet files - no transformation in between. The job starts by building a data frame from reading CSV files using readStream(), with a schema which I get…

apache-spark apache-spark-sql schema spark-structured-streaming

asked Feb 12 '18 at 22:54

Howard Xie

votes

2 answers

Error: java.lang.IllegalArgumentException: Option 'basePath' must be a directory

Based on book available in https://github.com/jaceklaskowski/spark-structured-streaming-book/blob/master/spark-structured-streaming.adoc, I'm trying to play with Spark Structured Streaming using the spark-shell, but struggling to get it working. My…

apache-spark spark-structured-streaming

asked Jan 20 '18 at 15:19

Kleyson Rios

2,597
5
40
65

Prev 1 2

…

99 100 Next