Questions tagged [spark-kafka-integration]

Use this tag for any Spark-Kafka integration. This tag should be used for both batch and stream processing while also covering Spark Streaming (DStreams) and Structured Streaming.

This tag is related to the spark-streaming-kafka and spark-sql-kafka libraries.

External sources:

To precise your question, you can consider adding

This tag serves as a synonym for the existing (low traffic) tag which only focuses on Spark Streaming (not batch and not Structured Streaming).

96 questions
3
votes
2 answers

Kafka Spark Structured Streaming with SASL_SSL authentication

I have been trying to use Spark Structured Streaming API to connect to Kafka cluster with SASL_SSL. I have passed the jaas.conf file to the executors. It seems I couldn't set the values of keystore and truststore authentications. I tried passing the…
3
votes
2 answers

What is the "offset was changed from X to 0" error with a KafkaSource in Spark Structured Streaming?

I'm getting the error "offset was changed from X to 0, some data may have been missed" with a KafkaSource in a Spark Structured Streaming application with checkpointing but it doesn't seem to actually cause any problem. I'm trying to figure out what…
2
votes
1 answer

I have more data in a kafka topic but when i extract data using my pyspark application, I am getting only 1 row extracted, how to fix?

I have more data in a kafka topic but when i extract data using my pyspark application (which I use to extract from different kafka topics), I am getting only 1 row extracted. Previously I had extracted data from the same topic using the same…
rakk
  • 47
  • 7
2
votes
1 answer

Kafka and pyspark program: Unable to determine why dataframe is empty

Below is my first program working with kafka and pyspark. The code seems to run without exceptions, but the output of my query is empty. I am initiating spark and kafka. Later, in Kafka initiation, I subscribed the topic = "quickstart-events" and…
2
votes
1 answer

kafka integration with Pyspark structured streaming (Windows)

After installing anaconda on my windows 10 machine, and then I followed the following tutorial to set it up on my machine and run it with jupyter : https://changhsinlee.com/install-pyspark-windows-jupyter/ spark version is 3.1.2 python is 3.8.8 so…
2
votes
1 answer

How to get at least N number of logs from Kafka through Spark?

In Spark streaming, I am getting logs as they arrive. But I want to get at least N number of logs in a single pass. How can it be achieved? From this answer, it appears there is such a utility in Kafka but doesn't seem to be present in Spark to make…
2
votes
1 answer

How to subscribe to a new topic with subscribePattern?

I am using Spark Structured streaming with Kafka and topic been subscribed as pattern: option("subscribePattern", "topic.*") // Subscribe to a pattern val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers",…
2
votes
1 answer

PySpark : Writing Kafka Topic to Console is Failing

Am getting the messages from Kafka Topic and writing it to a console. Reading the messages is not an issue, am able to read the message and also print the schema. But when am trying to write it to a console, its failing. Any suggestion would be…
1
vote
1 answer

Difference between spark-streaming-kafka-0-10 vs spark-sql-kafka-0-10

I am hoping to read a parquet file and write to Kafka import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.struct import org.apache.spark.sql.functions.to_json object IngestFromS3ToKafka { def main(args: Array[String]):…
Hongbo Miao
  • 45,290
  • 60
  • 174
  • 267
1
vote
0 answers

spark streaming from kafka on spark operator(Kubernetes)

I have a spark structured streaming job in scala, reading from kafka and writing to S3 as hudi tables. Now I am trying to move this job to spark operator on EKS. When I give the option in the yaml file. spark.jars.packages:…
1
vote
1 answer

Error while pulling kafka jks certificates from hdfs (trying with s3 as well) in spark

I am Running spark in cluster mode which is giving error as ERROR SslEngineBuilder: Modification time of key store could not be obtained: hdfs://ip:port/user/hadoop/jks/kafka.client.truststore.jks java.nio.file.NoSuchFileException:…
1
vote
0 answers

problem with spark structured streaming after restart

I have a simple pyspark code,which reads data from kafka and write aggregated records to oracle with foreachbatch.I have set checkpointLocation on hdfs dir and it works well.when I kill application and start code without any change, it gives the…
1
vote
0 answers

How to get commitedOffsets and availableOffsets from sparkstreaming

22/11/09 11:08:40 INFO MicroBatchExecution: Resuming at batch 206 with committed offsets {KafkaV2[Subscribe[test]]:…
1
vote
0 answers

Inferring a schema of a kafka topic taking too much time in databricks

I'm trying to determine the schema of a json kafka topic. To achieve that, I lifted a code part from this blog(https://medium.com/wehkamp-techblog/streaming-kafka-topic-to-delta-table-s3-with-spark-structured-streaming-2bb3027c7565). Background : I…
1
vote
2 answers

PySpark Structured Streaming with Kafka - Scaling Consumers for multiple topics with different loads

We subscribed to 7 topics with spark.readStream in 1 single running spark app. After transforming the event payloads, we save them with spark.writeStream to our database. For one of the topics, the data is inserted only batch-wise (once a day) with…