Highest Voted 'spark-streaming-kafka' Questions

1

vote

1 answer

How to get Total count of Records from Kafka Topic and Save into HDFS?

All, I am working on consuming data from Kafka on dump into HDFS. I am able to consume data and wanted to get the total counts of records from Kafka and save as a file into HDFS so that i can use that file for the validation. I was able to print…

asked May 21 '20 at 23:14

Rab

159
1
11

1

vote

0 answers

Formatting structured Kafka stream in pyspark using named regex

I'm trying to extract multiple columns values from an existing column in a streamable pyspark dataframe. I read the stream using stream_dataframe = spark_session.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers",…

python spark-structured-streaming pyspark spark-streaming-kafka

asked May 05 '20 at 05:58

Mister Spurious

271
2
11

1

vote

1 answer

Does skipped stages have any performance impact on Spark job?

I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the…

scala apache-spark spark-streaming spark-structured-streaming spark-streaming-kafka

asked Apr 14 '20 at 10:57

conetfun

1,605
4
17
38

1

vote

1 answer

KafkaUtils class in kafka 0.11

we are using spark streaming to read and write from kafka and uses the KafkaUtils libary in spark-streaming_2.11 which has the kafka 0.10.0 libs. Right I am in the process of upgrading the kafka-client jars to 0.11 to use some feature but since…

apache-kafka spark-streaming kafka-consumer-api spark-streaming-kafka

asked Mar 24 '20 at 03:38

Ajith Kannan

812
1
8
30

1

vote

1 answer

kafka direct stream and seekToEnd

In my Spark job I initialize Kafka stream with KafkaUtils.createDirectStream. I read about seekToEnd method for Consumer. How can I apply it to the stream?

scala apache-spark apache-kafka spark-streaming-kafka

asked Mar 11 '20 at 11:59

Dorg

11
2

1

vote

1 answer

Fault tolerant for Kafka Direct Stream do not work. Checkpoint directory does not exist

I Write app for read data from Kafka topic. And I can’t achieve fault tolerance in the event of a driver failure. The application runs in a k8s cluster using spark submit. When I run my application for the first time, everything goes well, but when…

pyspark apache-kafka fault-tolerance checkpoint spark-streaming-kafka

asked Mar 10 '20 at 14:56

Дмитрий Киселёв

11
5

1

vote

0 answers

Could not initialize class with mapGroupsWithState

I'm trying to create a spark structured streaming application with arbitrary state, when I add groupByKey and mapGroupWithState it gives me an error after starting the first task. .groupByKey(_.user_id) …

apache-spark spark-structured-streaming spark-streaming-kafka

asked Feb 27 '20 at 12:53

M. Alexandru

614
5
20

1

vote

0 answers

How to JSON data instead of JSON path in Spark Structure Streaming

I have data in a variable jsondata as below [{'sno': 1, 'number': '000-00-00000'}] How to use this data in JSON() during structure streaming in spark, which is actually expecting a path which i don't have. I have tried below code, but threw an…

apache-spark pyspark apache-spark-sql spark-streaming spark-streaming-kafka

asked Feb 10 '20 at 14:31

Srinathji Kyadari

221
5
15

1

vote

0 answers

How to add Spark dependencies from inside pyspark shell

I'm working with Spark 2.3.1, Kafka 1.1 and python 2.7.9, and I'm not able to update it. I've found a problem when I'm trying to use SparkStreaming to push (or pull) data from (to) Kafka. When I have no data in Kafka-queue (because it ends or…

scala apache-spark pyspark spark-streaming-kafka

asked Feb 10 '20 at 13:29

Krakenudo

182
1
17

1

vote

2 answers

Kafka create stream running but not printing the processed output from Kafka Topic in Pyspark

I am using Kafka version 2.0, Spark version is 2.2.0.2.6.4.0-91 , Python version is 2.7.5 I am running this below code and it streams without any error but the count is not printing in the output. import sys from pyspark import SparkContext from…

apache-spark pyspark apache-kafka kafka-python spark-streaming-kafka

asked Jan 22 '20 at 14:07

Afzal Abdul Azeez

195
1
13

1

vote

0 answers

Commit offsets after processing multiple streams in spark streaming

I have a usecase where we create multiple streams from the Kafka DStream. I would like to commit the offsets only after processing both the streams successfully. Is this possible? Current strategy: 1) create dstream one. 2) create dstream two. 3)…

java spark-streaming spark-streaming-kafka

asked Jan 16 '20 at 16:05

wandermonk

6,856
6
43
93

1

vote

0 answers

Spark Streaming dynamic executors overried kafka parameters in cluster mode

I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job. The first executor id always…

java apache-spark apache-kafka spark-streaming spark-streaming-kafka

asked Jan 14 '20 at 19:00

wandermonk

6,856
6
43
93

1

vote

0 answers

How to calculate the sum of values using PySpark with Kafka and Spark Streaming

Currently I receive 4 or more vehicle IoT sensory data records every 1 second, and I would like to for just simplicity sakes start by adding the 4 values for velocity readings. Most of the examples of code I have found provide counts which I can…

python pyspark spark-streaming addition spark-streaming-kafka

asked Dec 30 '19 at 21:01

Raanan Ikar

57
7

1

vote

0 answers

Correct way to store offsets in Kafka when using Spark and Elastic Search

I have done a lot of research on this, but I am still not able to get something suitable. Everywhere, I go, I see that the easiest way is to call saveToEs() and then commit offsets after that. My question is what if saveToEs() fails for some reason?…

apache-spark elasticsearch apache-kafka spark-streaming spark-streaming-kafka

asked Nov 05 '19 at 14:40

alina

291
2
9

1

vote

0 answers

Spark job fails once a day with java.io.OptionalDataException

I am using spark 2.2.0 and running my jobs using YARN on Cloudera. It's a streaming job which takes events from Kafka, filters and enriches them, stores them in ES and then commit offsets back to Kafka. These are the…

apache-spark elasticsearch apache-kafka spark-streaming spark-streaming-kafka

asked Oct 09 '19 at 14:03

alina

291
2
9

Questions tagged [spark-streaming-kafka]