Questions tagged [spark-streaming-kafka]

Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.

250 questions
1
vote
1 answer

How to get Total count of Records from Kafka Topic and Save into HDFS?

All, I am working on consuming data from Kafka on dump into HDFS. I am able to consume data and wanted to get the total counts of records from Kafka and save as a file into HDFS so that i can use that file for the validation. I was able to print…
Rab
  • 159
  • 1
  • 11
1
vote
0 answers

Formatting structured Kafka stream in pyspark using named regex

I'm trying to extract multiple columns values from an existing column in a streamable pyspark dataframe. I read the stream using stream_dataframe = spark_session.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers",…
1
vote
1 answer

Does skipped stages have any performance impact on Spark job?

I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the…
1
vote
1 answer

KafkaUtils class in kafka 0.11

we are using spark streaming to read and write from kafka and uses the KafkaUtils libary in spark-streaming_2.11 which has the kafka 0.10.0 libs. Right I am in the process of upgrading the kafka-client jars to 0.11 to use some feature but since…
1
vote
1 answer

kafka direct stream and seekToEnd

In my Spark job I initialize Kafka stream with KafkaUtils.createDirectStream. I read about seekToEnd method for Consumer. How can I apply it to the stream?
1
vote
1 answer

Fault tolerant for Kafka Direct Stream do not work. Checkpoint directory does not exist

I Write app for read data from Kafka topic. And I can’t achieve fault tolerance in the event of a driver failure. The application runs in a k8s cluster using spark submit. When I run my application for the first time, everything goes well, but when…
1
vote
0 answers

Could not initialize class with mapGroupsWithState

I'm trying to create a spark structured streaming application with arbitrary state, when I add groupByKey and mapGroupWithState it gives me an error after starting the first task. .groupByKey(_.user_id) …
1
vote
0 answers

How to JSON data instead of JSON path in Spark Structure Streaming

I have data in a variable jsondata as below [{'sno': 1, 'number': '000-00-00000'}] How to use this data in JSON() during structure streaming in spark, which is actually expecting a path which i don't have. I have tried below code, but threw an…
1
vote
0 answers

How to add Spark dependencies from inside pyspark shell

I'm working with Spark 2.3.1, Kafka 1.1 and python 2.7.9, and I'm not able to update it. I've found a problem when I'm trying to use SparkStreaming to push (or pull) data from (to) Kafka. When I have no data in Kafka-queue (because it ends or…
Krakenudo
  • 182
  • 1
  • 17
1
vote
2 answers

Kafka create stream running but not printing the processed output from Kafka Topic in Pyspark

I am using Kafka version 2.0, Spark version is 2.2.0.2.6.4.0-91 , Python version is 2.7.5 I am running this below code and it streams without any error but the count is not printing in the output. import sys from pyspark import SparkContext from…
1
vote
0 answers

Commit offsets after processing multiple streams in spark streaming

I have a usecase where we create multiple streams from the Kafka DStream. I would like to commit the offsets only after processing both the streams successfully. Is this possible? Current strategy: 1) create dstream one. 2) create dstream two. 3)…
wandermonk
  • 6,856
  • 6
  • 43
  • 93
1
vote
0 answers

Spark Streaming dynamic executors overried kafka parameters in cluster mode

I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job. The first executor id always…
1
vote
0 answers

How to calculate the sum of values using PySpark with Kafka and Spark Streaming

Currently I receive 4 or more vehicle IoT sensory data records every 1 second, and I would like to for just simplicity sakes start by adding the 4 values for velocity readings. Most of the examples of code I have found provide counts which I can…
1
vote
0 answers

Correct way to store offsets in Kafka when using Spark and Elastic Search

I have done a lot of research on this, but I am still not able to get something suitable. Everywhere, I go, I see that the easiest way is to call saveToEs() and then commit offsets after that. My question is what if saveToEs() fails for some reason?…
1
vote
0 answers

Spark job fails once a day with java.io.OptionalDataException

I am using spark 2.2.0 and running my jobs using YARN on Cloudera. It's a streaming job which takes events from Kafka, filters and enriches them, stores them in ES and then commit offsets back to Kafka. These are the…