Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.
Questions tagged [spark-streaming-kafka]
250 questions
1
vote
1 answer
Spark Structured Streaming custom partition directory name
I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job.
I partition my data by year/month/day.
The code is very simple:
df.withColumn("year",…

Vladimir
- 101
- 1
- 1
- 4
1
vote
1 answer
Issue in writing records in into MYSQL from Spark Structured Streaming Dataframe
I am using below code to write spark Streaming dataframe into MQSQL DB .Below is the kafka topic JSON data format and MYSQL table schema.Column name and types are same to same.
But I am unable to see records written in MYSQL table. Table is empty…

Sarvendra Singh
- 109
- 1
- 1
- 9
1
vote
1 answer
Unable to send Pyspark data frame to Kafka topic
I am trying to send data from a daily batch to a Kafka topic using pyspark, but I currently receive the following error:
Traceback (most recent call last): File "", line 5, in
…

TokyoMike
- 798
- 4
- 16
1
vote
1 answer
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start() kafka
I want to pipe a python machine learning file,predict the output and then attach it to my dataframe and then save it.
The error that I am getting is :-
Exception Details
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with…

Niraj kumar
- 11
- 4
1
vote
0 answers
Mask the data coming from Kafka stream
I am using spark Structured streaming to stream data from kafka, which gives me dataframe with below schema
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType …

Harshit
- 560
- 1
- 5
- 15
1
vote
1 answer
Debug Kafka pipeline by reading same topic with two different spark structured streams
I have a Kafka topic which is streaming data in my production. I want to use the same data stream for my debugging purpose and not impact the offsets for existing pipeline.
I remember using creating different consumer groups for this purpose in…

Harshit
- 560
- 1
- 5
- 15
1
vote
1 answer
Consuming from kafka using kafka methods and spark streaming gives different result
I am trying to consume some data from Kafka using spark streaming.
I have created 2 jobs,
A simple kafka job that uses:
consumeFirstStringMessageFrom(topic)
that gives topic expected values.
{
"data": {
"type": "SA_LIST",
"login":…

Driss NEJJAR
- 872
- 5
- 22
1
vote
1 answer
Is there any limit to number of records that can be produced to a Kafka topic in single produce command
I have a Databricks Kafka Producer that needs to write 62M records to a Kafka topic. Will there be an issue if I write 62M records at the same time? Or do I need to iterate say 20 times and write 3M records per iteration.
Here is the code.
Cmd1 val…

Don Sam
- 525
- 5
- 20
1
vote
1 answer
How to suppress stdout 'batch' when streaming spark?
How to change or totally supress this batch metadata and only show my thing?
-------------------------------------------
Batch:…

ERJAN
- 23,696
- 23
- 72
- 146
1
vote
1 answer
Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0
I have a Spark Structured Streaming consuming records from Kafka topic with 2 partition.
Spark Job: 2 queries, each consuming from 2 separate partition, running from same spark session.
val df1 = session.readStream.format("kafka")
…

Amit Joshi
- 172
- 1
- 14
1
vote
1 answer
Spark Structured Streaming won't pull the final batch from Kafka
I noticed that SSS won't process a waiting batch if there are no batches after that. What I saw is that Spark must always leave one batch on Kafka waiting to be consumed when it is writing Parquet to HDFS.
This is to do with the way Spark cleans up…

PHenry
- 157
- 1
- 6
1
vote
2 answers
Trying to consuming the kafka streams using spark structured streaming
I'm new to Kafka streaming. I setup a twitter listener using python and it is running in the localhost:9092 kafka server. I could consume the stream produced by the listener using a kafka client tool (conduktor) and also using the command…

Sri-nidhi
- 25
- 6
1
vote
1 answer
PySpark : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.kafka010.KafkaDataConsumer$
Am trying to fetch the messages from Kafka Topic and Print it in the console. Am able to fetch the messages through reader successfully, but when i try to print it in the console through writer, am getting below…

Jim Macaulay
- 4,709
- 4
- 28
- 53
1
vote
0 answers
spark streaming:- Error Caused by: java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
I am running a spark streaming application for reading data from kafka and dumping it into SQL but I am getting errors in the application, the application runs fin for a long period of time and then fails with the following error code.
log
Driver…

Nishad patil
- 11
- 1
1
vote
1 answer
How to distribute data evenly in Kafka producing messages through Spark?
I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others.
+-----------------------------------------------------+
| partition | messages | earlist offset | next…

Omar
- 47
- 6