Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
0
votes
1 answer

Streaming null fields into kafka using Pyspark

when writing a dataframe to kafka topic, the columns with null value are not appearing in the published message df.withColumn("test", f.lit(None)) data_frame.selectExpr( "CAST(id AS STRING) AS key", "to_json(struct(metadata,payload)) AS…
Smaillns
  • 2,540
  • 1
  • 28
  • 40
0
votes
0 answers

How to handle source table re-created from scratch?

I have two delta tables, one source and one destination and I batch-stream (using Trigger.AvailableNow()) from source to destination. When the source table is overwritten then the next run fails because the destination table does not recognize the…
pgrandjean
  • 676
  • 1
  • 9
  • 19
0
votes
0 answers

spark streaming dataframe query is stuck

I'm trying to read the data from kafka topic into spark streaming dataframe and write it to console. df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "my_host:9092") \ .option("subscribe",…
0
votes
0 answers

How to write ObjectId value to MongoDb using Spark connector 10.1 and Pyspark?

I’m having trouble trying to figure out how to write a value to MongoDb of type ObjectId using the Spark connector 10.1 using Python (Pyspark). Although I haven’t found much about it online, I have tried the solution in the below link which states…
qscott86
  • 303
  • 3
  • 11
0
votes
1 answer

Difference between Structured Streaming and Delta Live Tables in Databricks

I'm interested in what is the difference between Structured Streaming and Delta Live Tables. Databricks said For most streaming or incremental data processing or ETL tasks, Databricks recommends Delta Live Tables. Does it mean I should always stick…
0
votes
0 answers

Spark Structured Streaming with Avro deserialization: NullPointerException while trying to read the message

I'm trying to read Avro data using Spark Structured Streaming and Kafka. The code I am using is the following: package com.test.spark import com.test.spark.ConfigKafka.getAvroSchema import org.apache.avro.generic.{GenericDatumReader,…
0
votes
0 answers

java.lang.IllegalStateException: Must not use direct buffers with InputStream API

Getting below exception while trying trying to create new dataframe using coalece from an existing dataframe. Despite setting the hadoop config option dfs.client.use.legacy.blockreader as true, I am getting the error. The first line runs fine and df…
0
votes
1 answer

is there a way to do custom Window not time based on Kafka stream, using Pyspark

I have a Kafka stream sending me data of heartbeat for cyclist on circuit. I need to able to do AVG of heartbeat for each lap he did. I tried to use the session but it only works on time and in my case the time could be different each lap. I found…
0
votes
0 answers

Kafka streaming to Spark

I want to streaming Twitter data using Kafka, and doing sentiment analysis with Spark. The producer is working well, it can retrieve the data from the Twitter API to the Kafka Topics, but i got an error in the Spark as a consumer. Below is the code…
0
votes
0 answers

How does the trigger time on Steaming Dataset works for joins

I want to know how the trigger time for Streaming Dataset using join operations works for simple inner_joins. As far as I understand when the query starts if no org.apache.spark.sql.streaming.trigger() is defined, the trigger will trigger as soon as…
0
votes
0 answers

Spark watermark api issue

For Structured streaming watermark is 1 hr set in api. now I am using this api below in Streaming Listener: **event: StreamingQueryListener.QueryProgressEvent** triggerTime = Instant.parse(event.progress.timestamp) watermarkTime =…
vipin
  • 152
  • 12
0
votes
1 answer

Spark Structured Streaming: time window semantics and Available-now micro-batch

I don't need a constantly running cluster for processing my data, so I want, as Spark documentation suggests, use the available-now trigger: This is useful in scenarios you want to periodically spin up a cluster, process everything that is…
Dmitry B.
  • 9,107
  • 3
  • 43
  • 64
0
votes
1 answer

How to append records to delta table in foreachBatch?

I am using foreachbatch to write streaming data into multiple targets and its working fine for the first microbatch execution. When it tries to run the second microbatch, it fails with the below error. "StreamingQueryException: Query [id =…
Nikesh
  • 47
  • 6
0
votes
0 answers

Pyspark Mongodb connector append to array

I am using mongo-spark-connector_2.12:10.1.1 and I'm trying to save a dataframe to MongoDB. Here are my Mongodb write configuration and code: def write_mongo_batches(self,df): return …
0
votes
1 answer

Why does spark need both write ahead log and checkpoint?

Why does spark need both write ahead log and checkpoint? Why can’t we only use checkpoint? What is the benefit of additionally using write ahead log? What is the difference between the data stored in WAL and in checkpoint?
1 2 3
99
100