Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
1 answer
Store algebird Bloom Filter with Storehaus
I have a Spark job whose final output is an Algebird bloom filter, and I'd need to reuse this bloom filter in another Spark job.
Is there a way to store this bloom filter in a kv store (eg: redis) using Twitter Storehaus and retrieve it in the…

arnaud briche
- 1,479
- 3
- 20
- 25
2
votes
1 answer
java.io.NotSerializableException in Spark Streaming with enabled checkpointing
code below:
def main(args: Array[String]) {
val sc = new SparkContext
val sec = Seconds(3)
val ssc = new StreamingContext(sc, sec)
ssc.checkpoint("./checkpoint")
val rdd = ssc.sparkContext.parallelize(Seq("a","b","c"))
val…

Guo
- 1,761
- 2
- 22
- 45
2
votes
1 answer
Why spark streaming executors start at different time?
I'm using Spark streaming 1.6 which uses kafka as a source
My input arguments are as follows:
num-executors 5
num-cores 4
batch Interval 10 sec
maxRate 600
blockInterval 350 ms
Why does some of my executors start later than…

Vadym B.
- 681
- 7
- 21
2
votes
1 answer
Spark streaming print on received stream
What I am trying ot achieve is basically print "hello world" each time I receive a stream of data.
I know that on each stream I can call the function foreachRDD but that does not help me because:
It might be that there is no data processed
I don't…

Kevin Cohen
- 1,211
- 2
- 15
- 22
2
votes
1 answer
Does Apache Storm have machine learning libraries like with Apache Spark?
I am comparing Apache Storm and Apache Spark streaming for choosing a distributed realtime computation system. There are already lots of discussion giving comparisons between these two technologies for instance…

Yassir S
- 1,032
- 3
- 21
- 44
2
votes
1 answer
error: value succinct is not a member of org.apache.spark.rdd.RDD[String]
I am trying out succinctRDD for searching mechanism.
Below is what I am trying as per the doc:
import edu.berkeley.cs.succinct.kv._
val data = sc.textFile("file:///home/aman/data/jsonDoc1.txt")
val succintdata = data.succinct.persist()
The link…

Amaresh
- 3,231
- 7
- 37
- 60
2
votes
0 answers
Is the operation inside foreachRDD supposed to be blocking?
In a Spark Streaming job, is the operation inside foreachRDD supposed to synchronous / blocking?
What if you do some asynchronous operation which returns a Future? Are you then supposed to do Await on that Future?
Note: This question is specifically…

Mikael Ståldal
- 374
- 1
- 3
- 11
2
votes
1 answer
Using futures in Spark-Streaming & Cassandra (Scala)
I am rather new to spark, and I wonder what is the best practice when using spark-streaming with Cassandra.
Usually, when performing IO, it is a good practice to execute it inside a Future (in Scala).
However, a lot of the spark-cassandra-connector…

EranM
- 303
- 1
- 3
- 14
2
votes
0 answers
How to test Spark streaming code
I have a class that pulls in RDDs from a Flume stream.
I'd like to test it by having the test populate the stream.
I thought using the queueStream method on StreamingContext would work but I'm running into problems:
I get NullPointerExceptions…

s d
- 2,666
- 4
- 26
- 42
2
votes
0 answers
Better method of connecting spark streamming, sockets and rabbitMQ
To get around the trouble of consuming rabbit messages directly in spark streaming, I decided to consume messages using pika (python adaptor) and send them using sockets with the aim of getting spark streaming to communicate with the send data via…

disruptive
- 5,687
- 15
- 71
- 135
2
votes
1 answer
What is a supported streaming datasource to persist result?
I'm trying to use the new streamed writing feature with spark 2.0.1-SNAPSHOT. which output datasource are actually supported to persist the results?
I was able to display the output on console with something like this:
Dataset testData =…

Paolo
- 21
- 1
2
votes
0 answers
Spark gives a StackOverflowError when training using FPGrowth
I am using the FPGrowth in sparks's mllib to find frequent patterns.
Here is my code:
object FPGrowthExample{
def main(args:Array[String]){
val conf = new SparkConf().setAppName("FPGrowthExample")
val sc = new SparkContext(conf)
…

chenqun
- 21
- 2
2
votes
1 answer
Spark (streaming) RDD foreachPartitionAsync functionality/working
I will come to actual question but please bear with my use-case first. I have the following use-case, say I got rddStud from somewhere:
val rddStud: RDD[(String,Student)] = ???
Where 'String' - some random string and 'Student' - case class…

K P
- 861
- 1
- 8
- 25
2
votes
1 answer
Saving protobuf in Hbase/HDFS using Spark streaming
I am looking to store the protobuf messages in Hbase/HDFS using spark streaming. And I have below two questions
What is the efficient way of storing huge number of protobuf
messages and the efficient way of retrieving them to do some
analytics? For…

Lokesh Kumar P
- 369
- 5
- 20
2
votes
1 answer
Spark streaming from Kafka one task lags behind causing the whole batch to slow down
I have a spark streaming application that reads data from Kafka through network. It is important to note that the cluster and the Kafka servers are in different geographies.
The average time to complete a job is around 8-10 minutes (I am running 10…

Sohaib
- 4,556
- 8
- 40
- 68