Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.
Questions tagged [spark-streaming-kafka]
250 questions
1
vote
1 answer
How to get Total count of Records from Kafka Topic and Save into HDFS?
All,
I am working on consuming data from Kafka on dump into HDFS. I am able to consume data and wanted to get the total counts of records from Kafka and save as a file into HDFS so that i can use that file for the validation. I was able to print…

Rab
- 159
- 1
- 11
1
vote
0 answers
Formatting structured Kafka stream in pyspark using named regex
I'm trying to extract multiple columns values from an existing column in a streamable pyspark dataframe.
I read the stream using
stream_dataframe = spark_session.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",…

Mister Spurious
- 271
- 2
- 11
1
vote
1 answer
Does skipped stages have any performance impact on Spark job?
I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the…

conetfun
- 1,605
- 4
- 17
- 38
1
vote
1 answer
KafkaUtils class in kafka 0.11
we are using spark streaming to read and write from kafka and uses the KafkaUtils libary in spark-streaming_2.11 which has the kafka 0.10.0 libs. Right I am in the process of upgrading the kafka-client jars to 0.11 to use some feature but since…

Ajith Kannan
- 812
- 1
- 8
- 30
1
vote
1 answer
kafka direct stream and seekToEnd
In my Spark job I initialize Kafka stream with KafkaUtils.createDirectStream.
I read about seekToEnd method for Consumer. How can I apply it to the stream?

Dorg
- 11
- 2
1
vote
1 answer
Fault tolerant for Kafka Direct Stream do not work. Checkpoint directory does not exist
I Write app for read data from Kafka topic. And I can’t achieve fault tolerance in the event of a driver failure. The application runs in a k8s cluster using spark submit. When I run my application for the first time, everything goes well, but when…

Дмитрий Киселёв
- 11
- 5
1
vote
0 answers
Could not initialize class with mapGroupsWithState
I'm trying to create a spark structured streaming application with arbitrary state, when I add groupByKey and mapGroupWithState it gives me an error after starting the first task.
.groupByKey(_.user_id)
…

M. Alexandru
- 614
- 5
- 20
1
vote
0 answers
How to JSON data instead of JSON path in Spark Structure Streaming
I have data in a variable jsondata as below
[{'sno': 1, 'number': '000-00-00000'}]
How to use this data in JSON() during structure streaming in spark, which is actually expecting a path which i don't have.
I have tried below code, but threw an…

Srinathji Kyadari
- 221
- 5
- 15
1
vote
0 answers
How to add Spark dependencies from inside pyspark shell
I'm working with Spark 2.3.1, Kafka 1.1 and python 2.7.9, and I'm not able to update it.
I've found a problem when I'm trying to use SparkStreaming to push (or pull) data from (to) Kafka. When I have no data in Kafka-queue (because it ends or…

Krakenudo
- 182
- 1
- 17
1
vote
2 answers
Kafka create stream running but not printing the processed output from Kafka Topic in Pyspark
I am using Kafka version 2.0, Spark version is 2.2.0.2.6.4.0-91 , Python version is 2.7.5
I am running this below code and it streams without any error but the count is not printing in the output.
import sys
from pyspark import SparkContext
from…

Afzal Abdul Azeez
- 195
- 1
- 13
1
vote
0 answers
Commit offsets after processing multiple streams in spark streaming
I have a usecase where we create multiple streams from the Kafka DStream. I would like to commit the offsets only after processing both the streams successfully. Is this possible?
Current strategy:
1) create dstream one.
2) create dstream two.
3)…

wandermonk
- 6,856
- 6
- 43
- 93
1
vote
0 answers
Spark Streaming dynamic executors overried kafka parameters in cluster mode
I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.
The first executor id always…

wandermonk
- 6,856
- 6
- 43
- 93
1
vote
0 answers
How to calculate the sum of values using PySpark with Kafka and Spark Streaming
Currently I receive 4 or more vehicle IoT sensory data records every 1 second, and I would like to for just simplicity sakes start by adding the 4 values for velocity readings. Most of the examples of code I have found provide counts which I can…

Raanan Ikar
- 57
- 7
1
vote
0 answers
Correct way to store offsets in Kafka when using Spark and Elastic Search
I have done a lot of research on this, but I am still not able to get something suitable. Everywhere, I go, I see that the easiest way is to call saveToEs() and then commit offsets after that. My question is what if saveToEs() fails for some reason?…

alina
- 291
- 2
- 9
1
vote
0 answers
Spark job fails once a day with java.io.OptionalDataException
I am using spark 2.2.0 and running my jobs using YARN on Cloudera. It's a streaming job which takes events from Kafka, filters and enriches them, stores them in ES and then commit offsets back to Kafka. These are the…

alina
- 291
- 2
- 9