Questions tagged [spark-streaming-kafka]

Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.

250 questions
0
votes
1 answer

Spark Structured Streaming: Output result at the end of Tumbling Window and not the Batch

I want the output of Spark Stream to be sent to the Sink at the end of the Tumbling Window and not at the batch interval. I am reading from a Kafka stream and outputting to another Kafka stream. Code to query and write output is like…
0
votes
2 answers

Spark Structured Streaming to read nested Kafka Connect jsonConverter message

I have ingested xml file using KafkaConnect file-pulse connector 1.5.3 Then I want to read it with Spark Streaming to parse/flatten it. As it is quite nested. the string I read out of the kafka (I used the consumer console read this out, and put an…
0
votes
2 answers

Docker pypspark cluster container not receiving kafka streaming from the host?

I have created and deployed a spark cluster which consist of 4 container running spark master spark-worker spark-submit data-mount-container : to access the script from the local directory i added required dependency jar in all these…
0
votes
0 answers

Py4JJavaError: Job aborted due to stage failure: Task 0 in stage 460.0 failed 4 times

I am getting this weird error in my spark streaming code written in pyspark. I tried to debug this code but couldn't get any reason Below is my code. Name of the file is Script.py import os from pyspark.sql.types import * import json from pyspark…
0
votes
1 answer

What is the best way to structure a spark structured streaming pipeline?

I'm moving data from my postgres database to kafka and in the middle doing some transformations with spark. I Have 50 tables and for each table i have transformations totally different from the others. So, i want to know how is the best way to…
0
votes
0 answers

Kafka with Spark Streaming works in local but it doesn't work in Standalone mode

I'm trying to use Spark Streaming with a very simple script like this: from pyspark import SparkContext, SparkConf from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc =…
0
votes
1 answer

Too many KDC calls from KafkaConsumer on Spark streaming

I have a standalone (master=local for its own reasons) Spark structured streaming application that reads from kerberized kafka cluster. It works functionally, but it makes too many calls to KDC to fetch TGS for each micro-batch execution. Either…
0
votes
2 answers

In Spark, Unable to consume data from Kafka Topic

As I'm new to Spark & Kafka. In Spark facing issues while I'm trying to consume data from Kafka topic. I'm getting the following error. Can somebody help me please? In SBT project added all the dependencies: build.sbt file name :=…
0
votes
1 answer

RecordTooLargeException in Spark *Structured* Streaming

I keep getting this error message: The message is 1169350 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration. As indicated in other StackOverflow posts, I am trying to set…
0
votes
1 answer

Spark Structure streaming read data twice per every micro-batch. How to avoid

I have a very strange issue with spark structure streaming. Spark structure streaming creates two spark jobs for every micro-batch. As a result, read data from Kafka twice. Here is a simple code snippet. import org.apache.hadoop.fs.{FileSystem,…
0
votes
1 answer

I have a problem when working with readStream().format("kafka")

Help please fix the error: 20/04/09 18:38:44 ERROR MicroBatchExecution: Query [id = 9f3cbbf6-85a8-4aed-89c6-f5d3ff9c40fa, runId = 73c071c6-e222-4760-a750-393666a298af] terminated with error java.lang.ClassCastException:…
0
votes
1 answer

Fetch kafka headers in spark 2.4.X

How to get Kafka header fields (which were introduced in Kafka 0.11+) in Spark Structured Streaming? I see the headers implementation is added in Spark 3.0 but not in 2.4.5. And I see by default spark-sql-kafka-0-10 is using kafka-client 2.0. If it…
0
votes
1 answer

Extracting nested JSON values in Spark Streaming Java

How should I parse json messages from Kafka in Spark Streaming? I'm converting JavaRDD to Dataset and from there extracting the values. Found success in extracting values however I'm not able to extract nested json values such as "host.name" and…
Gokulraj
  • 450
  • 1
  • 3
  • 20
0
votes
1 answer

Does Kafka Direct Stream create a Consumer group by itself (as it does not care about group.id property given in application)

Let us say I have just launched a Kafka direct stream + spark streaming application. For the first batch, the Streaming Context in the driver program connects to the Kafka and fetches startOffset and endOffset. Then, it launches a spark job with…
0
votes
1 answer

Writing data from kafka to hive using pyspark - stucked

I quite new to spark and started with pyspark, I am learning to push data from kafka to hive using pyspark. from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import * from…
Jay
  • 1