Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.
Questions tagged [spark-streaming-kafka]
250 questions
0
votes
1 answer
How can it be possible? duplicate records in Kafka queue?
I'm using Apache Nifi and Spark and Kafka to send messages between them. First of all, I take data with Nifi and I send it to Spark to process it. Then, I send data from Spark to Nifi again to insert it in a DB.
My problem is Each time I run Spark,…

Krakenudo
- 182
- 1
- 17
0
votes
0 answers
Kafkastream is not listening on input when created through SparkSession Builder
I am trying to create a Kafka Consumer which uses MongoDB-Spark-Connector in the same program. Something like Kafka input as RDD --> to Dataframe and then store it in the MongoDB for later use.
My Producer is up and running and the "standard"…

Ranger
- 75
- 7
0
votes
2 answers
What are drawbacks of Spark Kafka Integration on local machine for real time twitter streaming analysis?
I am using Spark-Kafka Integration for working on my project which is to find top trending hashtags on twitter. For this, i am using Kafka for pushing tweets through tweepy Streaming and on the consumer side i am using Spark Streaming for DStream…

Dharmesh Singh
- 107
- 2
- 11
0
votes
1 answer
Spark Structured Streaming with Kafka source, change number of topic partitions while query is running
I've set up a Spark structured streaming query that reads from a Kafka topic.
If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.
Is there a…

redsk
- 261
- 6
- 11
0
votes
1 answer
Spark - Kudu predicate pushdown
I'm using kudu and spark streaming for a realtime dashboard, my problem is that when I'm joining the batch from spark streaming with kudu table it doesn't make a predicate pushdown on it and takes 2-3 seconds to fetch the entire table in spark and…

M. Alexandru
- 614
- 5
- 20
0
votes
1 answer
Opinion: Querying databases from Spark streaming or Structured streaming tasks
We have a Spark streaming use case where we need to compute some metrics from ingested events (in Kafka), but the computations require additional metadata which are not present in the events.
The obvious design pattern I can think of is to make…

AbhinavChoudhury
- 1,167
- 1
- 18
- 38
0
votes
0 answers
Spark structured streaming with kafka throwing error after running for a while
I am observing weired behaviour while running spark structured streaming program. I am using S3 bucket for metadata checkpointing.
The kafka topic has 310 partitions.
When i start streaming job for the first time, after completion of every batch…

unknown_k
- 11
- 2
0
votes
1 answer
How To Be Sure All Documents Written To Elasticsearch Integration Using Elasticsearch-Hadoop Connector In Spark Streaming
I am writing DStream to Elasticsearch using Elasticsearch-Hadoop connector. It's the link you can find the connector
https://www.elastic.co/guide/en/elasticsearch/hadoop/5.6/spark.html
I need to process the window, write all the documents to ES…

Yılmaz
- 185
- 2
- 14
0
votes
0 answers
Difficulties translating Scala Spark-Streaming code to Pyspark
I am trying to translate the Spark implementation to Pyspark, which is discussed in this blog:
https://dorianbg.wordpress.com/2017/11/11/building-the-speed-layer-of-lambda-architecture-using-structured-spark-streaming/
However, I am having a lot of…

Nelson Fleig
- 111
- 1
- 7
0
votes
0 answers
how do we plan the outage for kafka cluster when we are using kafka for spark streaming
If due to some reason we need to bring the kafka cluster down and still source continue to produce data then there will be loss of data, how do we plan the outage for kafka cluster.
Please advise how we can handle this scenario?
I have tried…

jin
- 11
- 4
0
votes
0 answers
Spark kafka streaming failing to determine position of partition
I am creating a Spark streaming application with Kafka.
val kafkaParams = Map[String,Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> kafkaConfig.bootstrapServers,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG ->…

bitan
- 444
- 4
- 14
0
votes
1 answer
Spark Structured streaming watermark corresponding to deviceid
Incoming data is stream like below consist of 3 columns
[
system -> deviceId,
time -> eventTime
value -> some metric
]
+-------+-------------------+-----+
|system |time …

Rahul Shukla
- 505
- 7
- 20
0
votes
1 answer
How to set streaming app checkpointing to Azure storage?
I am trying set checkpointing for spark streaming application to Azure storage. I was using S3 and the code was working fine.
Here is the latest code of how I set checkpointing to Azure.
sc.hadoopConfiguration
.set("fs.azure",…

MarkZ
- 29
- 9
0
votes
1 answer
AbstractMethodError upon creation of new StreamingContext
I've been having problems trying to instantiante a new StreamingContext of Spark Streaming.
I'm trying to create a new StreamingContext, but an error of AbstractMethodError is being thrown.
I've been debugging the stack trace and found out that…

Luan Araldi
- 31
- 1
- 9
0
votes
1 answer
Unauthorized error setting batch-bigtable as data host from Spark streaming
I'm following the example here for writing to Cloud Bigtable from Spark Streaming: https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/spark-streaming
In my instance, I'm consuming from Kafka, doing some transformations,…

j.r.e.
- 11
- 4