Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.
Questions tagged [spark-streaming-kafka]
250 questions
0
votes
0 answers
Distinct operation in spark structured streaming with a window operation
I want to implement a distinct operation in Spark structured code.
I have already watermarked it and windowed it, but still, Spark is not able to execute it.
FYI - Distinct comes under the list of unsupported operations in Spark streaming, but I…

Deval
- 97
- 7
0
votes
1 answer
Not able to read data through kafka spark streaming in pyspark
I am working on creating a basic streaming app which reads streaming data from kafka and process the data. Below is the code I am trying in pyspark
spark = SparkSession.builder.appName("testing").getOrCreate()
df = spark \
.readStream…

Nikhil
- 101
- 2
- 13
0
votes
1 answer
Spark structured streaming job not processing stages and showing in hung state
I am running one streaming application and processing data from Kafka to Kafka using spark.
If i am using latest then its working as expected and running without any issue.
but in source we have done bulk transaction (200 000) and using earliest…

Sonu
- 77
- 11
0
votes
1 answer
offset management in spark streaming
As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times)…

Gaurav Gupta
- 159
- 1
- 17
0
votes
0 answers
Problem integrating kafka and spark streaming no messages received in spark streaming
My spark streaming context is successfully subscribed to my kafka topic where my tweets are streamed using my twitter producer.But no messages is being streamed from topic in my spark streaming!
Here is my code
def main(args: Array[String]){
val…

bigdata1800
- 11
- 1
0
votes
0 answers
spark-streaming-kafka-0-10 does not support message handler
My use case is to print offset number , partition , topic for each record that has been read from kafka from a spark streaming application.
currently my code to create discrete stream looks like this.
val stream: InputDStream[ConsumerRecord[String,…

amarnath harish
- 945
- 7
- 24
0
votes
0 answers
How to use environment variables in spark which deployed on cluster mode?
When I set environment variable using Intellij below code works, but when i deploy code with spark-submit it does not work since environment variables are not exits on entire cluster.
import com.hepsiburada.util.KafkaUtil
import…

Enes Uğuroğlu
- 377
- 5
- 16
0
votes
1 answer
Spark Stucture streaming processing already processed record on failure
I am stuck with very weird issue in spark structure streaming. Whenever I am shutting down the stream and restart again it again process already processed record.
I tried to use spark.conf.set("spark.streaming.stopGracefullyOnShutdown", True) but…

Deepak
- 31
- 3
0
votes
0 answers
Spark streaming provides 2 kind of streams when integrating with kafka 1) Receiver Based 2) Direct What kind of steam structured streaming uses
Spark streaming provides 2 kind of streams when integrating with kafka
Receiver Based
Direct
What kind of stream structured streaming uses when we do spark.readstream.format("kafka")?

Abhinav Kumar
- 210
- 3
- 13
0
votes
1 answer
Spark Structured Streaming from Kafka to Elastic Search
I want to write a Spark Streaming Job from Kafka to Elasticsearch. Here I want to detect the schema dynamically while reading it from Kafka.
Can you help me to do that.?
I know, this can be done in Spark Batch Processing via below line.
val schema =…

Siva Samraj
- 37
- 1
- 5
0
votes
1 answer
Available options for a source/destination format of Spark structured streaming
When we use DataStreamReader API for a format in Spark, we specify options for the format used using option/options method. For example, In the below code, I'm using Kafka as the source and passing the configuration required for the source through…

Scarface
- 359
- 2
- 13
0
votes
1 answer
How do I get the data of one row of a Structured Streaming Dataframe in pyspark?
I have a Kafka broker with a topic connected to Spark Structured Streaming. My topic sends data to my streaming dataframe, and I'd like to get information on each row for this topic (because I need to compare each row with another database).
If I…

Donsitoz
- 19
- 5
0
votes
1 answer
How to merge multiple datatypes of an union type in avro schema to show one data type in the value field instead of member0 member1
I have the following avro schema
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "data",
"type":
{
"type": "map",
"values": ["int","string"]
}
}
…

Beluga
- 63
- 7
0
votes
0 answers
How to configure thread count on Spark Driver node?
We are running spark streaming job in stand-alone cluster mode with deploy mode as the client. This streaming job polls messages from kafka topic periodically, and the logs generated at the driver node is flushed to a txt file.
After running…

Anoop Deshpande
- 514
- 1
- 6
- 23
0
votes
0 answers
Pivot stream data in spark
I am reading data from Kafka topic and I want to pivot the data,
I am using the below code in spark shell
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = spark.readStream.format("kafka")
…

vishnupriya
- 55
- 9