Questions tagged [spark-kafka-integration]

Use this tag for any Spark-Kafka integration. This tag should be used for both batch and stream processing while also covering Spark Streaming (DStreams) and Structured Streaming.

This tag is related to the spark-streaming-kafka and spark-sql-kafka libraries.

External sources:

To precise your question, you can consider adding

This tag serves as a synonym for the existing (low traffic) tag which only focuses on Spark Streaming (not batch and not Structured Streaming).

96 questions
0
votes
1 answer

Get 2 different data from 1 kafka topic into 2 dataframes

I have a homework like this: Use python to read json files in 2 folders song_data and log_data. Use Python Kafka to publish a mixture of both song_data and log_data file types into a Kafka topic. Use Pyspark to consume data from the above Kafka…
0
votes
1 answer

Spark structured stream with tumbling window delayed and duplicate data

I am attempting to read from a kafka topic, aggregate some data over a tumbling window and write that to a sink (I've been trying both with Kafka and console). The problems I'm seeing are a long delay between sending data and receiving aggregate…
0
votes
1 answer

Spark Structured Streaming from Kafka to Elastic Search

I want to write a Spark Streaming Job from Kafka to Elasticsearch. Here I want to detect the schema dynamically while reading it from Kafka. Can you help me to do that.? I know, this can be done in Spark Batch Processing via below line. val schema =…
0
votes
1 answer

How can i send my structured streaming dataframe to kafka?

Hello everyone ! I'm trying to send my structured streaming dataframe to one of my kafka topics, detection. This is the schema of the structued streaming dataframe: root |-- timestamp: timestamp (nullable = true) |-- Sigma: string (nullable =…
0
votes
1 answer

Pyspark KAFKA ReadStream compatible jar version

I am trying to find a compatible version of the jar for pyspark readStream. I have explored many versions but was not been able to find a compatible jar. Please let me know in case I am performing anything wrong. My system configurations and used…
0
votes
1 answer

spark struct streaming writeStream output no data but no error

I have a struct streaming job which reads message from Kafka topic then saves to dbfs. The code is as follows: input_stream = spark.readStream \ .format("kafka") \ .options(**kafka_options) \ .load() \ .transform(create_raw_features) #…
0
votes
1 answer

Why Spark Streaming with kafka always poll(0) before seektoEnd

Spark version: 2.4.5 Component: Spark Streaming Class: DirectKafkaInputDStream In class DirectKafkaInputDStream, I am a little confused about that why should invoke paraniodPoll before seekToEnd? protected def latestOffsets(): Map[TopicPartition,…
0
votes
0 answers

PySpark ERROR Exception in Structured Streaming + Kafka Integration (Topic not present in metadata after 60000ms)

Hadoop 3.3.1 + Spark 3.1.2 OpenJDK 1.8.0 Scala 2.12.14 Library spark-sql-kafka-0-10_2.12-3.1.2.jar kafka-clients-2.6.0.jar spark-token-provider-kafka-0-10_2.12-3.1.2.jar Open PySpark shell pyspark --master yarn \ --deploy-mode client…
0
votes
1 answer

Spark Kafka connection failed : No resolvable bootstrap urls given in bootstrap.servers

I try to read a kafka topic with Spark 3.0.2, I do a spark shell with the following…
0
votes
1 answer

How to convert kafka message value to a particular schema?

I am trying to read data from Kafka topics using Pyspark. I want to transform that data into a particular schema. But unable to do so. Here is what I have tried: >> df = spark.read.format("kafka").option("kafka.bootstrap.servers",…
0
votes
0 answers

Kafka Integration with Pyspark Structured Streaming job stuck in [*] (with jupyter)

After installing Pyspark and testing it it works fine and adding the right connector for the kafka integration, now when I try to load the date from kafka from another machine in the same network and start the job, it gets stuck in [*], no error, no…
0
votes
0 answers

Spark kafka producer Introducing duplicate records during kafka ingestion

I have written a spark kafka producer, which pulls messages from hive and pushes into kafka, most of the records(messages) are getting duplicated when we ingest into kafka, though i do not have any duplicates before pushing into kafka. I have added…
0
votes
1 answer

How to create dataframe inside ForeachWriter[Row]

I have a streaming query that I'm reading from Kafka as the source. I want to perform some logic on each batch that I receive from the stream. Here's how I have done it so far val streamDF = spark .readStream ... …
0
votes
1 answer

How to call a method after a spark structured streaming query (Kafka)?

I need to execute some functions based on the values that I receive from topics. I'm currently using ForeachWriter to convert all the topics to a List. Now, I want to pass this List as a parameter to the methods. This is what I have so far def…
0
votes
1 answer

Spark Kafka Data Consuming Package

I tried to consume my kafka topic with the code below as mentioned in documentations: df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092,") \ .option("subscribe", "first_topic") \ …