Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
0 answers
Why can an autowired Spark Function be not accessed from inside another Spark Function
I am using a Spark filter Function f1 (call() method) inside another Spark Function f2. I am autowiring the function object f1. But this is ultimately throwing Task Not Serializable. Then, when I assigned this object to another local variable, it is…

ronojoy ghosh
- 121
- 10
2
votes
1 answer
Spark Streaming : source HBase
Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API…

void
- 2,403
- 6
- 28
- 53
2
votes
3 answers
Spark Streaming: how does mapWithState function work in cluster?
I am using Spark Streaming v2.0.0 to retrieve logs from Kafka and to do some manipulation. I am using the function mapWithState in order to save and update some fields related to a device. I am wondering how this function works in cluster. Indeed, i…

Yassir S
- 1,032
- 3
- 21
- 44
2
votes
0 answers
SPARK - Join Large Dataset with Smaller Dataset on Daily Basis
I have a always-growing table with ~43 Million unique rows as of today. I need to join this table with another smaller dataset on daily basis.
My cluster config is
Nodes : 3
Memory : 162 GB in total - 54GB per node
Total Cores :…

underwood
- 845
- 2
- 11
- 22
2
votes
2 answers
microbatch lookup in external database from spark
I have a requirement where I need to process log lines using Spark. One of the steps in processing is to lookup certain value in external DB.
For ex:
my log line contains multiple key-value pair. One of the key that is present in log is "key1".…

Alok
- 1,374
- 3
- 18
- 44
2
votes
2 answers
Spark Streaming + Kinesis : Receiver MaxRate is violated
I am calling spark-submit passing maxRate, I have a single kinesis receiver, and batches of 1s
spark-submit --conf spark.streaming.receiver.maxRate=10 ....
however a single batch can greatly exceed the stablished maxRate. i.e: Im getting 300…

David Przybilla
- 830
- 6
- 16
2
votes
1 answer
Apache Spark: Reference pointer to the parent RDD
I understand that SPARK maintains the lineage information for an RDD. Suppose I have an RDD "a" and using some transformation on that I produce a new RDD "b". In such a scenario, "a" is the parent RDD of "b". Is it possible to get back the RDD "a"…

Yassir S
- 1,032
- 3
- 21
- 44
2
votes
1 answer
Spark streaming: using object as key in 'mapToPair'
In my Spark Streaming application I receive the following data types:
{
"timestamp": 1479740400000,
"key": "power",
"value": 50
}
I want to group by timestamp and key and aggregate the value field.
Is there any way of keying by an object…

Stephen Young
- 852
- 1
- 11
- 21
2
votes
1 answer
Why would Spark Streaming application stall when consuming from Kafka on YARN?
I'm writing a Spark Streaming app in Scala. The goal of the app is to consume the latest records from Kafka and print them to stdout.
The app works perfectly when I run it locally using --master local[n]. However, when I run the app in YARN (and…

dqian96
- 25
- 5
2
votes
1 answer
How to make the consumer know that the Producer has finished sending all the messages to the Broker?
1: We are working on a near real time processing or Batch Processing using Spark Streaming. Our current design has Kafka included.
2: Every 15 minutes the Producer will send the messages.
3: We plan to use Spark Streaming to consume messages from…

Dasarathy D R
- 335
- 2
- 7
- 20
2
votes
1 answer
Spark Streaming gets stopped without errors after ~1 minute
When I make spark-submit for the Spart Streaming job, I can see that it's running during approximatelly 1 minute, and then it's stopped with the final status SUCCEEDED:
16/11/16 18:58:16 INFO yarn.Client: Application report for application_XXXX_XXX…

duckertito
- 3,365
- 2
- 18
- 25
2
votes
1 answer
Spark Streaming - How do i notify the Consumer once the Producer is done?
Is it possible to notify the Consumer, once the Producer publish all the data to Kafka topic?
There are possibilities the same data( with some unique field) is available in multiple partitions, so i need to group the data and do some calculation.
I…

Shankar
- 8,529
- 26
- 90
- 159
2
votes
0 answers
Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch
My Spark Streaming application (Spark 2.0 , on AWS EMR yarn cluster) listens to Campaigns based on live stock feeds and the batch duration is 5 seconds. The applications uses Kafka DirectStream and based on the feed source there are three streams.…

Dev Loper
- 209
- 1
- 4
- 18
2
votes
1 answer
Spark Streaming REST Custom Receiver
Is it possible to use a REST API in a custom receiver for Spark Streaming?
I am trying to be able to do multiple calls / reads from that API asynchronously and use Spark Streaming to do it.

Eugen
- 1,537
- 7
- 29
- 57
2
votes
0 answers
DStream.foreachRDD do not work ,after transformation
Guys!
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
kafkaStream.map(_._2).foreachRDD(rdd => rdd.foreach(println))
It worked ,it print kafka message.But when i run like…

user7135450
- 21
- 4