Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
1 answer

Store algebird Bloom Filter with Storehaus

I have a Spark job whose final output is an Algebird bloom filter, and I'd need to reuse this bloom filter in another Spark job. Is there a way to store this bloom filter in a kv store (eg: redis) using Twitter Storehaus and retrieve it in the…
arnaud briche
  • 1,479
  • 3
  • 20
  • 25
2
votes
1 answer

java.io.NotSerializableException in Spark Streaming with enabled checkpointing

code below: def main(args: Array[String]) { val sc = new SparkContext val sec = Seconds(3) val ssc = new StreamingContext(sc, sec) ssc.checkpoint("./checkpoint") val rdd = ssc.sparkContext.parallelize(Seq("a","b","c")) val…
Guo
  • 1,761
  • 2
  • 22
  • 45
2
votes
1 answer

Why spark streaming executors start at different time?

I'm using Spark streaming 1.6 which uses kafka as a source My input arguments are as follows: num-executors 5 num-cores 4 batch Interval 10 sec maxRate 600 blockInterval 350 ms Why does some of my executors start later than…
Vadym B.
  • 681
  • 7
  • 21
2
votes
1 answer

Spark streaming print on received stream

What I am trying ot achieve is basically print "hello world" each time I receive a stream of data. I know that on each stream I can call the function foreachRDD but that does not help me because: It might be that there is no data processed I don't…
Kevin Cohen
  • 1,211
  • 2
  • 15
  • 22
2
votes
1 answer

Does Apache Storm have machine learning libraries like with Apache Spark?

I am comparing Apache Storm and Apache Spark streaming for choosing a distributed realtime computation system. There are already lots of discussion giving comparisons between these two technologies for instance…
Yassir S
  • 1,032
  • 3
  • 21
  • 44
2
votes
1 answer

error: value succinct is not a member of org.apache.spark.rdd.RDD[String]

I am trying out succinctRDD for searching mechanism. Below is what I am trying as per the doc: import edu.berkeley.cs.succinct.kv._ val data = sc.textFile("file:///home/aman/data/jsonDoc1.txt") val succintdata = data.succinct.persist() The link…
Amaresh
  • 3,231
  • 7
  • 37
  • 60
2
votes
0 answers

Is the operation inside foreachRDD supposed to be blocking?

In a Spark Streaming job, is the operation inside foreachRDD supposed to synchronous / blocking? What if you do some asynchronous operation which returns a Future? Are you then supposed to do Await on that Future? Note: This question is specifically…
Mikael Ståldal
  • 374
  • 1
  • 3
  • 11
2
votes
1 answer

Using futures in Spark-Streaming & Cassandra (Scala)

I am rather new to spark, and I wonder what is the best practice when using spark-streaming with Cassandra. Usually, when performing IO, it is a good practice to execute it inside a Future (in Scala). However, a lot of the spark-cassandra-connector…
EranM
  • 303
  • 1
  • 3
  • 14
2
votes
0 answers

How to test Spark streaming code

I have a class that pulls in RDDs from a Flume stream. I'd like to test it by having the test populate the stream. I thought using the queueStream method on StreamingContext would work but I'm running into problems: I get NullPointerExceptions…
s d
  • 2,666
  • 4
  • 26
  • 42
2
votes
0 answers

Better method of connecting spark streamming, sockets and rabbitMQ

To get around the trouble of consuming rabbit messages directly in spark streaming, I decided to consume messages using pika (python adaptor) and send them using sockets with the aim of getting spark streaming to communicate with the send data via…
disruptive
  • 5,687
  • 15
  • 71
  • 135
2
votes
1 answer

What is a supported streaming datasource to persist result?

I'm trying to use the new streamed writing feature with spark 2.0.1-SNAPSHOT. which output datasource are actually supported to persist the results? I was able to display the output on console with something like this: Dataset testData =…
2
votes
0 answers

Spark gives a StackOverflowError when training using FPGrowth

I am using the FPGrowth in sparks's mllib to find frequent patterns. Here is my code: object FPGrowthExample{ def main(args:Array[String]){ val conf = new SparkConf().setAppName("FPGrowthExample") val sc = new SparkContext(conf) …
chenqun
  • 21
  • 2
2
votes
1 answer

Spark (streaming) RDD foreachPartitionAsync functionality/working

I will come to actual question but please bear with my use-case first. I have the following use-case, say I got rddStud from somewhere: val rddStud: RDD[(String,Student)] = ??? Where 'String' - some random string and 'Student' - case class…
K P
  • 861
  • 1
  • 8
  • 25
2
votes
1 answer

Saving protobuf in Hbase/HDFS using Spark streaming

I am looking to store the protobuf messages in Hbase/HDFS using spark streaming. And I have below two questions What is the efficient way of storing huge number of protobuf messages and the efficient way of retrieving them to do some analytics? For…
2
votes
1 answer

Spark streaming from Kafka one task lags behind causing the whole batch to slow down

I have a spark streaming application that reads data from Kafka through network. It is important to note that the cluster and the Kafka servers are in different geographies. The average time to complete a job is around 8-10 minutes (I am running 10…
Sohaib
  • 4,556
  • 8
  • 40
  • 68
1 2 3
99
100