Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both sparkr and sparklyr) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

How to get datetime column data from kafka

Hello everybody I have problem and don't know how to fix it. There is kafka source which delivers JSON with following…

python pyspark apache-kafka apache-spark-sql spark-structured-streaming

asked Jul 15 '23 at 06:03

Ali Moayed

votes

0 answers

Executor self-exiting due to : Unable to create executor due to URI has an authority component

We are working on a project where we have to deploy our application on a Spark cluster( based upon EKS). We are using Spark-Operator to manage our Spark cluster. Application Nature: My application is based on Spark's "structured streaming". It…

spark-streaming amazon-eks amazon-emr spark-structured-streaming spark-operator

asked Jul 10 '23 at 13:17

Praveen Kumar

votes

0 answers

How to add another column in a spark streaming dataframe whose value depends upon the value of another column

So i have a streaming source suppose adClick with a timestamp column named "adClick_time" and now i want to calculate the difference between this timestamp and current time. Since it is a streaming source so as the data reach currentTime will keep…

dataframe apache-spark dataset spark-streaming spark-structured-streaming

asked Jul 07 '23 at 19:35

Subodh kumar

votes

0 answers

In Java Application, Dataframe.foreach() not able to fetch row when Spark.master=yarn is used

I have a Java Spark application, which I run using Spark-submit command. My Specific use case involves fetching offsets data from Kafka using spark consumer in a Dataframe (Dataset) and performing transformation(spark sql) on each row of this…

java apache-spark-sql spark-streaming spark-structured-streaming spark-submit

asked Jul 06 '23 at 18:06

Saurabh Sable

votes

1 answer

Spark Structured Streaming: Writing DataFrame as CSV fails because of a missing watermark

I am using Spark, version 3.4.1 PySpark, version 3.4.1 Python, version 3.11 Using Spark Structured Streaming I want to write a DataFrame as CSV file. logsDF is a pyspark.sql.dataframe.DataFrame with the schema: root |-- timestampLog: string…

apache-spark pyspark spark-streaming spark-structured-streaming

asked Jul 05 '23 at 21:39

jpseng

1,618
6
18

votes

1 answer

PySpark understanding how topic offsets are committed in Kafka structured streaming

I am working with Spark for the first time and I read the docs here but i am still struggling to understand what maxOffsetsPerTrigger setting does. Specifically, what does a "trigger interval" mean here? I am suspecting that this is a spark setting…

apache-spark pyspark apache-kafka apache-spark-sql spark-structured-streaming

asked Jul 05 '23 at 21:15

lollerskates

votes

0 answers

How to calculate the average for last "n" days in spark structured streaming

I am looking at computing average steps taken my the user for "Last 7 days". The device can push multiple records for the total steps taken along with the timestamp. The indicative schema is like below id…

apache-spark spark-structured-streaming

asked Jul 05 '23 at 13:12

Sharath Chandra

votes

0 answers

clean checkpoint state files of spark stateful structured streaming

I struggle to find a solution for cleaning the checkpoint state files whose number grows overtime after I start a spark stateful structured streaming which ends up take up a lot of disk space. When saying checkpoint state file I mean the delta and…

apache-spark spark-structured-streaming spark-checkpoint

asked Jul 03 '23 at 11:41

lize su

votes

0 answers

Is there any example of connecting from spark structured to kafka using sasl oauthbearer mechanism

I have to connect to spring kafka from spark structued streaming program using a sasl_ssl/oauthbearer mechanism. I will receive the token from inhouse adsf portal. I have been trying to find an example or detailed mechanism of how this integration…

spring-kafka spark-structured-streaming sasl

asked Jul 03 '23 at 07:09

novice programmer

votes

1 answer

Multiple aggregation in Spark Structured Streaming

I'm building a data pipeline using Spark Structured Streaming, which reads data from Kafka. Here is my source code: queries = [] plug_df = event_df.withWatermark('timestamp', '10 minutes').groupby( f.window(f.col('timestamp'), '5 minutes', '5…

apache-spark pyspark spark-structured-streaming

asked Jul 03 '23 at 07:04

Anh Duc Ng

votes

0 answers

Unable to authenticate to schemaregistry with base auth and TLS certificate

I am want to read from a kafka topic hosted on on-prem cluster and write to gcp bucket. Schema registry requires both basic auth and TLS authentication. I tried below but it doesn't seem to be working. object KafkaConnection { val brokers =…

ssl apache-kafka spark-structured-streaming google-cloud-dataproc confluent-schema-registry

asked Jun 28 '23 at 13:51

S D

votes

1 answer

I am facing "java.lang.OutOfMemoryError: GC overhead limit exceeded" while working with spark streaming + Kafka

I am working with spark structured streaming, taking around 10M records of data from kafka topic, transforming it and saving to mysql. I am facing "java.lang.OutOfMemoryError: GC overhead limit exceeded" with spark, I want to limit the number of…

apache-spark pyspark apache-kafka spark-structured-streaming

asked Jun 27 '23 at 13:37

Muhammad Affan

votes

0 answers

Spark streaming to Kafka giving java.lang.UnsatisfiedLinkError

Writing a test program to stream a file data to a Kafka stream using Spark 3.4.0 object SparkTest { def main(args : Array[String]): Unit ={ val schema = StructType( List( StructField("id", StringType,…

scala apache-spark apache-kafka spark-structured-streaming

asked Jun 26 '23 at 19:58

Krisrettiwt

votes

1 answer

Apache Spark with watermark - processing data different LogTypes in same kafka topic

I'm using Apache Spark Structured Streaming to read data from Kafka topic, and do some processing. I'm using watermark to account for late-coming records and the code works fine. Here is the working(sample) code: from pyspark.sql import…

apache-spark pyspark apache-kafka spark-structured-streaming watermark

asked Jun 24 '23 at 17:50

Karan Alang

votes

1 answer

Issue with Glue script reading Kafka events based on timestamp

I'm facing an issue with my Glue script that reads events from Kafka. Currently, I'm using Spark Structured Streaming and the script reads events starting from the earliest offset. However, I would like to modify it to read events based on a…

apache-spark pyspark apache-kafka spark-structured-streaming

asked Jun 24 '23 at 10:33

Smaillns

2,540
1
28
40

Prev 1 2 3

…

99 100 Next