Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
1 answer
How to load JSON(path saved in csv) with Spark?
I am new to Spark.
I can load the .json file in Spark. What if there are thousands of .json files in a folder. picture of .json files in the folder
And I have a csv file, which classifies the .json files with labels.picture of csv file
What should I…

Fengyu
- 35
- 2
- 6
2
votes
1 answer
How to figure out if DStream is empty
I have 2 inputs, where first input is stream (say input1) and the second one is batch (say input2).
I want to figure out if the keys in first input matches single row or more than one row in the second input.
The further transformations/logic…

Dazzler
- 807
- 9
- 11
2
votes
1 answer
How to map key/value partitions in paralell in Spark Streaming
I have a Spark Streaming program running in local mode in which I receive JSON messages from a TCP socket connection, several per batch interval.
Each of these messages has an ID, which I use to create a key/value JavaPairDStream, such that in each…

manuel mourato
- 801
- 1
- 12
- 36
2
votes
2 answers
Multiple consumers exactly-once processing with Apache Spark Streaming
I am looking to process elements on a queue (Kafka or Amazon Kinesis) and to have multiple operations to be performed on each element, for example:
Write that to HDFS cluster
Invoke a rest API
Trigger a notification on slack.
On each of these…

Edmondo
- 19,559
- 13
- 62
- 115
2
votes
1 answer
Execute multiple actions parallel/async in Spark Streaming
is there a way to execute multiple actions async/parallel in spark streaming?
Here is my code:
positions.foreachRDD(rdd -> {
JavaRDD pbv = rdd.map(p -> A.create(p));
javaFunctions(pbv).writerBuilder("poc",…

mananana
- 393
- 3
- 15
2
votes
0 answers
spark streaming application - deployment best practices
I am using spark-submit cluster mode deployment for my application to run it in production.
But this places a requirement of having the jars in the same path in all the nodes and also the config file which is passed as argument in the same path.
I…

Knight71
- 2,927
- 5
- 37
- 63
2
votes
2 answers
How to add jar using HiveContext in the spark job
I am trying to add JSONSerDe jar file to in order to access the json data load the JSON data to hive table from the spark job. My code is shown below:
SparkConf sparkConf = new SparkConf().setAppName("KafkaStreamToHbase");
JavaSparkContext…

Bhaskar
- 271
- 7
- 20
2
votes
1 answer
Dstream Runtime Creation/Destruction
Can Dstream with new names be created and older dstream be destroyed on runtime?
//Read the Dstream
inputDstream = ssc.textFileStream("./myPath/")
Example:
I am reading a file called cvd_filter.txt in which every single line contains a string…

vkb
- 458
- 1
- 7
- 18
2
votes
0 answers
Could not read until the end sequence number of the range
I have a spark streaming application which reads data from Kinesis, processes it and sends the result into Elasticsearch. It was working fine But it suddenly started throwing the following error while reading data from…

SMN
- 46
- 5
2
votes
0 answers
Spark Streaming variable in UpdateStateByKey not changing value after restarting application from checkpoint
I'm currently working in Python building a moderately complex application that relies on stateful data from multiple sources. With Pyspark I've run into an issue where a global variable used within an updateStateByKey function isn't being assigned…

JoeP
- 21
- 3
2
votes
1 answer
Spark Streaming -> DStream.checkpoint versus SparkStreaming.checkpoint
I have Spark 1.4 Streaming application, which reads data from Kafka, uses statefull transformation, and has batch interval of 15 seconds.
In order to use statefull transformations, as well as recover from driver failures, I need to set checkpointing…

Srdjan Nikitovic
- 853
- 2
- 9
- 19
2
votes
1 answer
Caching DStream in Spark Streaming
I have a Spark streaming process which reads data from kafka,
into a DStream.
In my pipeline I do two times (one after another):
DStream.foreachRDD( transformations on RDD and inserting into destination).
(each time I do different processing and…

Srdjan Nikitovic
- 853
- 2
- 9
- 19
2
votes
0 answers
Spark akka throws a java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc
I have built spark using scala 2.11. I ran the following steps :
./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
After building spark successfully, I tried to intialize spark via akka model .
So,…

Raveesh Sharma
- 1,486
- 5
- 21
- 38
2
votes
0 answers
Spark streaming reliable receiver and BlockGenerator?
As I understand, when implementing a reliable receiver for spark streaming, the block generation needs to be taken care of in the custom receiver. Is this as easy as collecting some events to some kind of queue and then storing the iterator? Or…

Sunny
- 605
- 10
- 35
2
votes
0 answers
Spark Streaming Task Distribution
I have one spark streaming program that uses updateStateByKey.
When I run it on a cluster with 3 machines, all updateStateByKey tasks (these are heavy tasks) run on one machine. This results in a scheduling delay on inputs, while other machines have…

Majid Hajibaba
- 3,105
- 6
- 23
- 55