Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
0
votes
0 answers

How to get datetime column data from kafka

Hello everybody I have problem and don't know how to fix it. There is kafka source which delivers JSON with following…
0
votes
0 answers

Executor self-exiting due to : Unable to create executor due to URI has an authority component

We are working on a project where we have to deploy our application on a Spark cluster( based upon EKS). We are using Spark-Operator to manage our Spark cluster. Application Nature: My application is based on Spark's "structured streaming". It…
0
votes
0 answers

How to add another column in a spark streaming dataframe whose value depends upon the value of another column

So i have a streaming source suppose adClick with a timestamp column named "adClick_time" and now i want to calculate the difference between this timestamp and current time. Since it is a streaming source so as the data reach currentTime will keep…
0
votes
0 answers

In Java Application, Dataframe.foreach() not able to fetch row when Spark.master=yarn is used

I have a Java Spark application, which I run using Spark-submit command. My Specific use case involves fetching offsets data from Kafka using spark consumer in a Dataframe (Dataset) and performing transformation(spark sql) on each row of this…
0
votes
1 answer

Spark Structured Streaming: Writing DataFrame as CSV fails because of a missing watermark

I am using Spark, version 3.4.1 PySpark, version 3.4.1 Python, version 3.11 Using Spark Structured Streaming I want to write a DataFrame as CSV file. logsDF is a pyspark.sql.dataframe.DataFrame with the schema: root |-- timestampLog: string…
0
votes
1 answer

PySpark understanding how topic offsets are committed in Kafka structured streaming

I am working with Spark for the first time and I read the docs here but i am still struggling to understand what maxOffsetsPerTrigger setting does. Specifically, what does a "trigger interval" mean here? I am suspecting that this is a spark setting…
0
votes
0 answers

How to calculate the average for last "n" days in spark structured streaming

I am looking at computing average steps taken my the user for "Last 7 days". The device can push multiple records for the total steps taken along with the timestamp. The indicative schema is like below id…
Sharath Chandra
  • 654
  • 8
  • 26
0
votes
0 answers

clean checkpoint state files of spark stateful structured streaming

I struggle to find a solution for cleaning the checkpoint state files whose number grows overtime after I start a spark stateful structured streaming which ends up take up a lot of disk space. When saying checkpoint state file I mean the delta and…
0
votes
0 answers

Is there any example of connecting from spark structured to kafka using sasl oauthbearer mechanism

I have to connect to spring kafka from spark structued streaming program using a sasl_ssl/oauthbearer mechanism. I will receive the token from inhouse adsf portal. I have been trying to find an example or detailed mechanism of how this integration…
0
votes
1 answer

Multiple aggregation in Spark Structured Streaming

I'm building a data pipeline using Spark Structured Streaming, which reads data from Kafka. Here is my source code: queries = [] plug_df = event_df.withWatermark('timestamp', '10 minutes').groupby( f.window(f.col('timestamp'), '5 minutes', '5…
Anh Duc Ng
  • 169
  • 1
  • 8
0
votes
0 answers

Unable to authenticate to schemaregistry with base auth and TLS certificate

I am want to read from a kafka topic hosted on on-prem cluster and write to gcp bucket. Schema registry requires both basic auth and TLS authentication. I tried below but it doesn't seem to be working. object KafkaConnection { val brokers =…
0
votes
1 answer

I am facing "java.lang.OutOfMemoryError: GC overhead limit exceeded" while working with spark streaming + Kafka

I am working with spark structured streaming, taking around 10M records of data from kafka topic, transforming it and saving to mysql. I am facing "java.lang.OutOfMemoryError: GC overhead limit exceeded" with spark, I want to limit the number of…
0
votes
0 answers

Spark streaming to Kafka giving java.lang.UnsatisfiedLinkError

Writing a test program to stream a file data to a Kafka stream using Spark 3.4.0 object SparkTest { def main(args : Array[String]): Unit ={ val schema = StructType( List( StructField("id", StringType,…
0
votes
1 answer

Apache Spark with watermark - processing data different LogTypes in same kafka topic

I'm using Apache Spark Structured Streaming to read data from Kafka topic, and do some processing. I'm using watermark to account for late-coming records and the code works fine. Here is the working(sample) code: from pyspark.sql import…
0
votes
1 answer

Issue with Glue script reading Kafka events based on timestamp

I'm facing an issue with my Glue script that reads events from Kafka. Currently, I'm using Spark Structured Streaming and the script reads events starting from the earliest offset. However, I would like to modify it to read events based on a…
Smaillns
  • 2,540
  • 1
  • 28
  • 40