Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
1 answer
What happens when I run out of memory to maintain the state with mapWithState
I have a very large number of keys and limited cluster size.
I am using mapWithState to update my states. As new data comes in the number of keys increases. When I went to the storage tab of the spark UI MapWithStateRDD is always stored in…

Rishi
- 148
- 1
- 7
2
votes
1 answer
Spark aggregateByKey on Dataset
Here's an example of aggregateByKey on mutable.HashSet[String] written by @bbejeck
val initialSet = mutable.HashSet.empty[String]
val addToSet = (s: mutable.HashSet[String], v: String) => s += v
val mergePartitionSets = (p1: mutable.HashSet[String],…

faustineinsun
- 451
- 1
- 6
- 16
2
votes
0 answers
How do we process/scale variable size batches in Apache Spark Streaming
I am running a spark streaming process where I am getting batch of data after n seconds. I am using repartition to scale the application. Since the repartition size is fixed we are getting lots of small files when batch size is very small. Is…

Alchemist
- 849
- 2
- 10
- 27
2
votes
1 answer
Spark Streaming input rate drop
Running a Spark Streaming job, I have encountered the following behavior more than once. Processing starts well: the processing time for each batch is well below the batch interval. Then suddenly, the input rate drops to near zero. See these…

Socci
- 337
- 2
- 12
2
votes
0 answers
SparkStreaming: How to get list like collect()
I am beginner of SparkStreaming.
I want to load HBase record at SparkStreaming App.
So, I write the the under code by python.
My "load_records" function is getting HBase Records and return the records.
SparkStreaming can not use collect().…

penlight
- 617
- 10
- 26
2
votes
1 answer
Excluding hadoop dependency from spark library in sbt file
I am working on spark 1.3.0 . My build.sbt looks as follows:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.3.0" % "provided",
"org.apache.spark" %% "spark-sql" % "1.3.0" % "provided",
"org.apache.spark" %%…

Alok
- 1,374
- 3
- 18
- 44
2
votes
1 answer
Best solution to accumulate Spark Streaming DStream
I'm looking for the best solution to accumulate the last N number of messages in a Spark DStream. I'd also like to specify the number of messages to retain.
For example, given the following stream, I'd like to retain the last 3 elements:
Iteration …

user278530
- 83
- 2
- 11
2
votes
1 answer
Using Kafka to communicate between long running Spark jobs
I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm…

smeeb
- 27,777
- 57
- 250
- 447
2
votes
2 answers
Counting records of my RDDs in a large Dstream
I am trying to work with a large RDD as read by a file DStream.
The code looks as follows:
val creatingFunc = { () =>
val conf = new SparkConf()
.setMaster("local[10]")
.setAppName("FileStreaming")
…

Mahdi
- 787
- 1
- 8
- 33
2
votes
2 answers
Using Spark StreamingContext to Consume from Kafka topic
I am brand new to Spark & Kafka and am trying to get some Scala code (running as a Spark job) to act as a long-running process (not just a short-lived/scheduled task) and to continuously poll a Kafka broker for messages. When it receives messages, I…

smeeb
- 27,777
- 57
- 250
- 447
2
votes
2 answers
Unable to serialize SparkContext in foreachRDD
I am trying to save the streaming data to cassandra from Kafka. I am able to read and parse the data but when I call below lines to save the data i am getting a Task not Serializable Exception. My class is extending serializable but not sure why i…

Suresh
- 38,717
- 16
- 62
- 66
2
votes
0 answers
How to stabilize spark streaming application with a handful of super big sessions?
I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.
A session is simply all of the records with the same ID .…

ZianyD
- 171
- 2
- 12
2
votes
1 answer
Spark 2.0.0 twitter streaming driver is no longer available
During migration from spark 1.6.2 to spark 2.0.0 appeared that package org.apache.spark.streaming.twitter has been removed and twitter streaming is no longer available as well as dependency
org.apache.spark
…

Ivan Shulak
- 100
- 6
2
votes
0 answers
Streaming pdf files using spark streaming filestream
I am building an application that scans pdf files and extract data from them.
I have already built an application that does batch processing using spark core but now I want the data to be continuously streamed from the directory.
How can I use spark…

fady zohdy
- 45
- 1
- 8
2
votes
1 answer
Spark History Logs Are Not Enabled with Oozie Spark Action in Cloudera
I am trying to follow this instructions to enable history logs with Spark Oozie action.
https://archive.cloudera.com/cdh5/cdh/5/oozie/DG_SparkActionExtension.html
To ensure that your Spark job shows up in the Spark History Server, make sure to…

Alchemist
- 849
- 2
- 10
- 27