Questions tagged [spark-kafka-integration]

Use this tag for any Spark-Kafka integration. This tag should be used for both batch and stream processing while also covering Spark Streaming (DStreams) and Structured Streaming.

This tag is related to the spark-streaming-kafka and spark-sql-kafka libraries.

External sources:

To precise your question, you can consider adding

This tag serves as a synonym for the existing (low traffic) tag which only focuses on Spark Streaming (not batch and not Structured Streaming).

96 questions
1
vote
1 answer

PySpark Kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/security/JaasContext

I am encountering problem with printing the data to console from kafka topic. The error message I get is shown in below image. 22/09/06 10:14:02 ERROR MicroBatchExecution: Query [id = ba6cb0ca-a3b1-41be-9551-7956650fbdab, runId =…
1
vote
0 answers

sparkstreaming kafka cunsumer auto close,

I don't want to use one consumer for all topics, I want to use this method to improve consumption efficiency val kafkaParams = Map( ConsumerConfig.GROUP_ID_CONFIG -> group, ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, …
1
vote
3 answers

NoSuchMethodError: org.apache.spark.sql.kafka010.consumer

I am using Spark Structured Streaming to read messages from multiple topics in kafka. I am facing below error: java.lang.NoSuchMethodError:…
1
vote
0 answers

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below : calludf = F.udf(lambda x: function_name(x)) dfraw = spark.readStream.format('kafka') \ .option('kafka.bootstrap.servers',…
1
vote
2 answers

How to run Spark structured streaming using local JAR files

I'm using one of the Docker images of EMR on EKS (emr-6.5.0:20211119) and investigating how to work on Kafka with Spark Structured Programming (pyspark). As per the integration guide, I run a Python script as following. $SPARK_HOME/bin/spark-submit…
1
vote
1 answer

Structured Streaming startingOffest and Checkpoint

I am confused about startingOffsets in structured streaming. In the official docs here, it says query type Streaming - is this continuous streaming? Batch - is this for query with forEachBatch or triggers? (latest is not allowed) My workflow also…
1
vote
1 answer

Read Kafka messages in spark batch job

What is the best option to read each day , the latest messages from kafka topic, in spark-batch job (running on EMR) ? I don't want to use spark-streaming , cause don't have a cluster 24/7. I saw the option of…
1
vote
0 answers

How to improve performance of Kafka (producer and consumer) when Installed in Single PC

I am working on a video analytics project where I have to detect 5 kinds of objects from 10 number of CCTV cameras. And, The customer provided only one Ubuntu PC system to deploy my Video analytics Engine. Now, I have to install all of my…
SaddamBinSyed
  • 553
  • 3
  • 17
1
vote
1 answer

PySpark Kafka - NoClassDefFound: org/apache/commons/pool2

I am encountering problem with printing the data to console from kafka topic. The error message I get is shown in below image. As you can see in the above image that after batch 0 , it doesn't process further. All this are snapshots of the error…
1
vote
1 answer

Spark Scala : Failed to find data source: kafka

I'm trying to use an example from sparkByExamples.com . But for some reason, spark program doesn't read data from kafka topic. Code is here Error msg below: Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data…
1
vote
0 answers

How to limit spark consuming data from kafka by time

I am new to Spark. I have a spark streaming batch job(maybe it should be structure streaming) which receive data from kafka horuly. And I found my spark keeps consuming data and would not stop. So I want to control it, for example, Now it is 3 am,…
1
vote
0 answers

Solved - Unable to set TLS parameters for Kafka connection from spark

I am having a problem to set the required parameters to connect to Kafka from spark using TLS. This is my current approach: spark.readStream .format("kafka") .option("kafka.bootstrap.servers", ":tls port") …
1
vote
1 answer

What is the best way to perform multiple filter operations on spark streaming dataframe read from Kafka?

I need to apply multiple filters on a DataFrame read from a Kafka topic and publish output of each of these filter to an external system (like another Kafka topic). I have read the kafkaDF like this val kafkaDF: DataFrame = spark.readStream …
1
vote
2 answers

Spark Structured Streaming StreamingQueryListener.onQueryProgress not called per microbatch?

I'm using Spark 3.0.2 and I have a streaming job that consumes data from Kafka with trigger duration of "1 minute". I see in Spark UI that there is a new job every 1 minute as defined, but I see method onQueryProgress is being called every 5~6…
1
vote
1 answer

How to avoid queuing up of Batches in spark streaming

I have spark streaming with Direct Streaming and I am using below config Batch interval 60s spark.streaming.kafka.maxRatePerPartition 42 auto.offset.reset earliest As I am starting the streaming batch with earliest option, to consume the messages…