Questions tagged [spark-kafka-integration]

Use this tag for any Spark-Kafka integration. This tag should be used for both batch and stream processing while also covering Spark Streaming (DStreams) and Structured Streaming.

This tag is related to the spark-streaming-kafka and spark-sql-kafka libraries.

External sources:

To precise your question, you can consider adding

This tag serves as a synonym for the existing (low traffic) tag which only focuses on Spark Streaming (not batch and not Structured Streaming).

96 questions
1
vote
1 answer

Rewind and reconsume offset in structured streaming from Kafka

Is there a way we can rewind the offset in Structured Streaming? I am using Spark version 3 and I have configured my startingoffset as earliest and every restart after that will be picking the offset value from checkpoint directory. For example:…
1
vote
2 answers

Send data to Kafka topics based on a condition in Dataframe

I want to change the Kafka topic destination to save the data depending on the value of the data in SparkStreaming. Is it possible to do so again? When I tried the following code, it only executes the first one, but does not execute the lower…
1
vote
1 answer

Writing Spark DataFrame to Kafka is ignoring the partition column and kafka.partitioner.class

I am trying to write a Spark DF (batch DF) to Kafka and i need to write the data to specific partitions. I tried the following code myDF.write .format("kafka") .option("kafka.bootstrap.servers", kafkaProps.getBootstrapServers) …
1
vote
1 answer

kafka-consumer-groups command doesnt show LAG and CURRENT-OFFSET for spark structured streaming applications(consumers)

I have a spark structured streaming application consuming from kafka, for this application I would like to monitor the consumer lag. I 'm using below command to check consumer lag. However I don't get the CURRENT-OFFSET and hence LAG is blank too.…
1
vote
0 answers

Logging with Spark/Kafka stream processing application

I'm new to working in Scala with the Spark and Kafka integration. However, I'm running into an issue logging. I have tried many different logging libraries, but they all return the same error from Spark. The error is the following: Exception in…
1
vote
3 answers

Spark 3 structured streaming use maxOffsetsPerTrigger in Kafka source with Trigger.Once

We need to use maxOffsetsPerTrigger in the Kafka source with Trigger.Once() in structured streaming but based on this issue it seems reads allAvailable in spark 3. Is there a way for achieving rate limit in this situation? Here is a sample code in…
0
votes
1 answer

How to add Kafka dependencies for PySpark on a Jupyter notebook

I have setup kafka 2.1 on windows and able to successfully communicate a topic from producer to consumer over localhost:9092. I now want to consume this in a spark structured stream. For this I setup spark 3.4 and installed pyspark over Jupyter…
rarpal
  • 175
  • 1
  • 13
0
votes
0 answers

Spark Kafka: understanding offset management with enable.auto.commit

according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept. For example I have a Kafka that shall batch load everyday and only shall load…
0
votes
0 answers

Spark Streaming: How to handle failure in Spark when connecting to multiple kafka cluster via Union of Dstream?

I have a requirement where I have to read from multiple Kafka clusters (more than 20 clusters) via spark streaming. I am able to read all of them by basically creating kafka Direct Stream for all the kafka cluster and performing union on…
0
votes
1 answer

Running Kafka and Spark with docker-compose

My goal is to send/produce a txt from my Windows PC to a container running Kafka, to then be consume by pyspak (running in other container). I'm using docker-compose where I define a custom net and several containers, such as: spark-master, two…
0
votes
1 answer

How to map a message to a object with `schema` and `payload` in Spark structured streaming correctly?

I am hoping to map a message to a object with schema and payload inside in during Spark structured streaming. This is my original code val input_schema = new StructType() .add("timestamp", DoubleType) .add("current", DoubleType) …
0
votes
1 answer

Getting Error for org.apache.spark.sql.Encoder and missing or invalid dependency find while loading class file SQLImplicits, LowPrioritySQLImplicits

I am running following code to read kafka stream with spark-3.2.2, and scala 2.12.0. Earlier same code was working fine with spark-2.2 and scala 2.11.8, import spark.implicits._ val kafkaStream = spark .readStream .format("kafka") …
0
votes
0 answers

What is the best way to set up a kafka connection with apache spark

How to make the kafka stream more stable? as it it will run constantly without having us to start the run again after it fails (so far we are thinking about using the "continous" run mode to make it automatically start a new run even after a…
0
votes
0 answers

Spark Kafka error while publishing data to kafka topic

Am getting the below error while publishing(writestream) dataframe data to kafka topic. Can you please guide me here?
0
votes
0 answers

How to read DF Column of struct type and add they key value pairs to kafka headers?

I have a new dataframe with 2 columns one is headers and other is a payload, facing issues in reading the headers column and assigning the values to kafka headers while publishing. earlier the dataframe had 4 columns as Old Df Schema : Id -…