Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.
Questions tagged [spark-streaming-kafka]
250 questions
0
votes
1 answer
Spark Structured Streaming: Output result at the end of Tumbling Window and not the Batch
I want the output of Spark Stream to be sent to the Sink at the end of the Tumbling Window and not at the batch interval.
I am reading from a Kafka stream and outputting to another Kafka stream.
Code to query and write output is like…

Nils
- 806
- 1
- 9
- 24
0
votes
2 answers
Spark Structured Streaming to read nested Kafka Connect jsonConverter message
I have ingested xml file using KafkaConnect file-pulse connector 1.5.3
Then I want to read it with Spark Streaming to parse/flatten it. As it is quite nested.
the string I read out of the kafka (I used the consumer console read this out, and put an…

soMuchToLearnAndShare
- 977
- 11
- 25
0
votes
2 answers
Docker pypspark cluster container not receiving kafka streaming from the host?
I have created and deployed a spark cluster which consist of 4 container running
spark master
spark-worker
spark-submit
data-mount-container : to access the script from the local directory
i added required dependency jar in all these…

GvrHari
- 1
- 3
0
votes
0 answers
Py4JJavaError: Job aborted due to stage failure: Task 0 in stage 460.0 failed 4 times
I am getting this weird error in my spark streaming code written in pyspark. I tried to debug this code but couldn't get any reason
Below is my code. Name of the file is Script.py
import os
from pyspark.sql.types import *
import json
from pyspark…

yahoo
- 183
- 3
- 22
0
votes
1 answer
What is the best way to structure a spark structured streaming pipeline?
I'm moving data from my postgres database to kafka and in the middle doing some transformations with spark.
I Have 50 tables and for each table i have transformations totally different from the others.
So, i want to know how is the best way to…

Luan Carvalho
- 190
- 2
- 10
0
votes
0 answers
Kafka with Spark Streaming works in local but it doesn't work in Standalone mode
I'm trying to use Spark Streaming with a very simple script like this:
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc =…

Davide
- 73
- 5
0
votes
1 answer
Too many KDC calls from KafkaConsumer on Spark streaming
I have a standalone (master=local for its own reasons) Spark structured streaming application that reads from kerberized kafka cluster.
It works functionally, but it makes too many calls to KDC to fetch TGS for each micro-batch execution.
Either…

user3150983
- 79
- 1
- 5
0
votes
2 answers
In Spark, Unable to consume data from Kafka Topic
As I'm new to Spark & Kafka. In Spark facing issues while I'm trying to consume data from Kafka topic. I'm getting the following error. Can somebody help me please?
In SBT project added all the dependencies:
build.sbt file
name :=…
0
votes
1 answer
RecordTooLargeException in Spark *Structured* Streaming
I keep getting this error message:
The message is 1169350 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
As indicated in other StackOverflow posts, I am trying to set…

DilTeam
- 2,551
- 9
- 42
- 69
0
votes
1 answer
Spark Structure streaming read data twice per every micro-batch. How to avoid
I have a very strange issue with spark structure streaming.
Spark structure streaming creates two spark jobs for every micro-batch.
As a result, read data from Kafka twice.
Here is a simple code snippet.
import org.apache.hadoop.fs.{FileSystem,…

Grigoriev Nick
- 1,099
- 8
- 24
0
votes
1 answer
I have a problem when working with readStream().format("kafka")
Help please fix the error:
20/04/09 18:38:44 ERROR MicroBatchExecution: Query [id = 9f3cbbf6-85a8-4aed-89c6-f5d3ff9c40fa, runId = 73c071c6-e222-4760-a750-393666a298af] terminated with error
java.lang.ClassCastException:…

Vladimir
- 1
- 1
0
votes
1 answer
Fetch kafka headers in spark 2.4.X
How to get Kafka header fields (which were introduced in Kafka 0.11+) in Spark Structured Streaming?
I see the headers implementation is added in Spark 3.0 but not in 2.4.5.
And I see by default spark-sql-kafka-0-10 is using kafka-client 2.0.
If it…

Kishorekumar Yakkala
- 311
- 7
- 14
0
votes
1 answer
Extracting nested JSON values in Spark Streaming Java
How should I parse json messages from Kafka in Spark Streaming?
I'm converting JavaRDD to Dataset and from there extracting the values. Found success in extracting values however I'm not able to extract nested json values such as "host.name" and…

Gokulraj
- 450
- 1
- 3
- 20
0
votes
1 answer
Does Kafka Direct Stream create a Consumer group by itself (as it does not care about group.id property given in application)
Let us say I have just launched a Kafka direct stream + spark streaming application. For the first batch, the Streaming Context in the driver program connects to the Kafka and fetches startOffset and endOffset. Then, it launches a spark job with…

Abhilash Reddy
- 17
- 5
0
votes
1 answer
Writing data from kafka to hive using pyspark - stucked
I quite new to spark and started with pyspark, I am learning to push data from kafka to hive using pyspark.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import *
from…

Jay
- 1