Questions tagged [spark-kafka-integration]

Use this tag for any Spark-Kafka integration. This tag should be used for both batch and stream processing while also covering Spark Streaming (DStreams) and Structured Streaming.

This tag is related to the spark-streaming-kafka and spark-sql-kafka libraries.

External sources:

To precise your question, you can consider adding

This tag serves as a synonym for the existing (low traffic) tag which only focuses on Spark Streaming (not batch and not Structured Streaming).

96 questions
0
votes
1 answer

Pyspark : Kafka consumer for multiple topics

I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics increases, then the number of threads consuming from the…
erك
  • 3
  • 1
0
votes
0 answers

I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark

I am using this tech stack: Spark version: 3.3.1 Scala Version: 2.12.15 Hadoop Version: 3.3.4 Kafka Version: 3.3.1 I am trying to get data from kafka topic through spark structure streaming, But I am facing mentioned error, Code I am using is: For…
0
votes
1 answer

java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

I am using Spark2.3.0 and kafka1.0.0.3. I have created a spark read stream df = spark.readStream. \ format("kafka"). \ option("kafka.bootstrap.servers", "localhost.cluster.com:6667"). \ option("subscribe", "test_topic"). \ …
0
votes
1 answer

How to process the dataframe which was read from Kafka Topic using Spark Streaming

I'm able to stream twitter data into my Kafka topic via a producer. When I try to consume through the default Kafka consumer I'm able to see the tweets as well. But when I try to use Spark Streaming to consume this and process further, I'm unable…
0
votes
0 answers

Kafka-PySpark: "earliest" or "latest" not getting recognized as offests

I have a Kafka-PySpark streaming code that reads the topic. I have the kafka configuration given in a kafka.yml file where I have specified the startingOffsets: checkpointLocation:…
0
votes
0 answers

Is it possible to avoid Shuffle while getting distinct data from Spark DataFrame

Let's Assume, we have data gathered from Kafka with 3 partitions (from one topic). Kafka Keys and Values presented as table below: | key | value …
0
votes
1 answer

How to writestream to specific kafka cluster in Azure databricks? " Topic mytopic not present in metadata after 60000 ms."

I am trying to write data to Kafka with the writestream method. I have been given the following properties by the source system. topic = 'mytopic' host = "myhost.us-west-1.aws.confluent.cloud:9092" userid = 'myuser' password ='mypassword' Cluster…
0
votes
1 answer

spark-submit --packages returns Error: Missing application resource

I installed .NET for Apache Spark using the following guide: https://learn.microsoft.com/en-us/dotnet/spark/tutorials/get-started?WT.mc_id=dotnet-35129-website&tabs=windows The Hello World worked. Now I am trying to connect to and read from a Kafka…
Kenci
  • 4,794
  • 15
  • 64
  • 108
0
votes
1 answer

Spark Structured Streaming - Kafka - Missing required configuration "partition.assignment.strategy

That is my code. import findspark findspark.init() import os os.environ[ "PYSPARK_SUBMIT_ARGS" ] = "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 pyspark-shell" from pyspark.sql import SparkSession from pyspark.sql.types import…
0
votes
0 answers

how to pause kafka consumer in spark streaming application not to process any data

How I can pause kafka consumer so that my application will stop processing data . public static void main(String[] args) throws InterruptedException { SparkConf conf = new SparkConf() .setMaster("local[*]") …
0
votes
0 answers

Databricks Kafka Read Not connecting

I'm trying to read data from GCP kafka through azure databricks but getting below warning and notebook is simply not completing. Any suggestion please?   WARN NetworkClient: Consumer groupId Bootstrap broker rack disconnected   Please note I've…
0
votes
2 answers

PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition

I'm running Spark version 2.3.0.2.6.5.1175-1 with Python 3. 6.8 on Ambari. While submitting the application I get the following logs in stderr 22/06/15 12:29:31 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Exception in…
0
votes
2 answers

How to send pyspark dataframe to kafka topic?

pyspark version - 2.4.7 kafka version - 2.13_3.2.0 Hi, I am new to pyspark and streaming properties. I have come across few resources in the internet, but still I am not able to figure out how to send a pyspark data frame to a kafka broker. I need…
0
votes
1 answer

Spark Streaming - Kafka - Java -jar not working but running by Java application it works

I've a simple Java Spark script. That basically it's to return kafka data: import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; public class App { …
Pedro Alves
  • 1,004
  • 1
  • 21
  • 47
0
votes
1 answer

pyspark: how to perform structured streaming using KafkaUtils

I am doing a structured streaming using SparkSession.readStream and writing it to hive table, but seems it does not allow me to time-based micro-batches, i.e. I need a batch of 5 secs. All the messages should forms a batch of 5 secs, and the batch…
aiman
  • 1,049
  • 19
  • 57