Questions tagged [spark-streaming-kafka]

Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.

250 questions
1
vote
1 answer

Spark Structured Streaming custom partition directory name

I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job. I partition my data by year/month/day. The code is very simple: df.withColumn("year",…
1
vote
1 answer

Issue in writing records in into MYSQL from Spark Structured Streaming Dataframe

I am using below code to write spark Streaming dataframe into MQSQL DB .Below is the kafka topic JSON data format and MYSQL table schema.Column name and types are same to same. But I am unable to see records written in MYSQL table. Table is empty…
1
vote
1 answer

Unable to send Pyspark data frame to Kafka topic

I am trying to send data from a daily batch to a Kafka topic using pyspark, but I currently receive the following error: Traceback (most recent call last): File "", line 5, in …
1
vote
1 answer

org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start() kafka

I want to pipe a python machine learning file,predict the output and then attach it to my dataframe and then save it. The error that I am getting is :- Exception Details Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with…
1
vote
0 answers

Mask the data coming from Kafka stream

I am using spark Structured streaming to stream data from kafka, which gives me dataframe with below schema Column Type key binary value binary topic string partition int offset long timestamp long timestampType …
1
vote
1 answer

Debug Kafka pipeline by reading same topic with two different spark structured streams

I have a Kafka topic which is streaming data in my production. I want to use the same data stream for my debugging purpose and not impact the offsets for existing pipeline. I remember using creating different consumer groups for this purpose in…
1
vote
1 answer

Consuming from kafka using kafka methods and spark streaming gives different result

I am trying to consume some data from Kafka using spark streaming. I have created 2 jobs, A simple kafka job that uses: consumeFirstStringMessageFrom(topic) that gives topic expected values. { "data": { "type": "SA_LIST", "login":…
1
vote
1 answer

Is there any limit to number of records that can be produced to a Kafka topic in single produce command

I have a Databricks Kafka Producer that needs to write 62M records to a Kafka topic. Will there be an issue if I write 62M records at the same time? Or do I need to iterate say 20 times and write 3M records per iteration. Here is the code. Cmd1 val…
1
vote
1 answer

How to suppress stdout 'batch' when streaming spark?

How to change or totally supress this batch metadata and only show my thing? ------------------------------------------- Batch:…
ERJAN
  • 23,696
  • 23
  • 72
  • 146
1
vote
1 answer

Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0

I have a Spark Structured Streaming consuming records from Kafka topic with 2 partition. Spark Job: 2 queries, each consuming from 2 separate partition, running from same spark session. val df1 = session.readStream.format("kafka") …
Amit Joshi
  • 172
  • 1
  • 14
1
vote
1 answer

Spark Structured Streaming won't pull the final batch from Kafka

I noticed that SSS won't process a waiting batch if there are no batches after that. What I saw is that Spark must always leave one batch on Kafka waiting to be consumed when it is writing Parquet to HDFS. This is to do with the way Spark cleans up…
1
vote
2 answers

Trying to consuming the kafka streams using spark structured streaming

I'm new to Kafka streaming. I setup a twitter listener using python and it is running in the localhost:9092 kafka server. I could consume the stream produced by the listener using a kafka client tool (conduktor) and also using the command…
1
vote
1 answer

PySpark : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.kafka010.KafkaDataConsumer$

Am trying to fetch the messages from Kafka Topic and Print it in the console. Am able to fetch the messages through reader successfully, but when i try to print it in the console through writer, am getting below…
1
vote
0 answers

spark streaming:- Error Caused by: java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

I am running a spark streaming application for reading data from kafka and dumping it into SQL but I am getting errors in the application, the application runs fin for a long period of time and then fails with the following error code. log Driver…
1
vote
1 answer

How to distribute data evenly in Kafka producing messages through Spark?

I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others. +-----------------------------------------------------+ | partition | messages | earlist offset | next…
Omar
  • 47
  • 6