Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
1 answer

What happens when I run out of memory to maintain the state with mapWithState

I have a very large number of keys and limited cluster size. I am using mapWithState to update my states. As new data comes in the number of keys increases. When I went to the storage tab of the spark UI MapWithStateRDD is always stored in…
Rishi
  • 148
  • 1
  • 7
2
votes
1 answer

Spark aggregateByKey on Dataset

Here's an example of aggregateByKey on mutable.HashSet[String] written by @bbejeck val initialSet = mutable.HashSet.empty[String] val addToSet = (s: mutable.HashSet[String], v: String) => s += v val mergePartitionSets = (p1: mutable.HashSet[String],…
faustineinsun
  • 451
  • 1
  • 6
  • 16
2
votes
0 answers

How do we process/scale variable size batches in Apache Spark Streaming

I am running a spark streaming process where I am getting batch of data after n seconds. I am using repartition to scale the application. Since the repartition size is fixed we are getting lots of small files when batch size is very small. Is…
Alchemist
  • 849
  • 2
  • 10
  • 27
2
votes
1 answer

Spark Streaming input rate drop

Running a Spark Streaming job, I have encountered the following behavior more than once. Processing starts well: the processing time for each batch is well below the batch interval. Then suddenly, the input rate drops to near zero. See these…
Socci
  • 337
  • 2
  • 12
2
votes
0 answers

SparkStreaming: How to get list like collect()

I am beginner of SparkStreaming. I want to load HBase record at SparkStreaming App. So, I write the the under code by python. My "load_records" function is getting HBase Records and return the records. SparkStreaming can not use collect().…
penlight
  • 617
  • 10
  • 26
2
votes
1 answer

Excluding hadoop dependency from spark library in sbt file

I am working on spark 1.3.0 . My build.sbt looks as follows: libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.3.0" % "provided", "org.apache.spark" %% "spark-sql" % "1.3.0" % "provided", "org.apache.spark" %%…
Alok
  • 1,374
  • 3
  • 18
  • 44
2
votes
1 answer

Best solution to accumulate Spark Streaming DStream

I'm looking for the best solution to accumulate the last N number of messages in a Spark DStream. I'd also like to specify the number of messages to retain. For example, given the following stream, I'd like to retain the last 3 elements: Iteration …
user278530
  • 83
  • 2
  • 11
2
votes
1 answer

Using Kafka to communicate between long running Spark jobs

I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm…
smeeb
  • 27,777
  • 57
  • 250
  • 447
2
votes
2 answers

Counting records of my RDDs in a large Dstream

I am trying to work with a large RDD as read by a file DStream. The code looks as follows: val creatingFunc = { () => val conf = new SparkConf() .setMaster("local[10]") .setAppName("FileStreaming") …
Mahdi
  • 787
  • 1
  • 8
  • 33
2
votes
2 answers

Using Spark StreamingContext to Consume from Kafka topic

I am brand new to Spark & Kafka and am trying to get some Scala code (running as a Spark job) to act as a long-running process (not just a short-lived/scheduled task) and to continuously poll a Kafka broker for messages. When it receives messages, I…
smeeb
  • 27,777
  • 57
  • 250
  • 447
2
votes
2 answers

Unable to serialize SparkContext in foreachRDD

I am trying to save the streaming data to cassandra from Kafka. I am able to read and parse the data but when I call below lines to save the data i am getting a Task not Serializable Exception. My class is extending serializable but not sure why i…
2
votes
0 answers

How to stabilize spark streaming application with a handful of super big sessions?

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records. A session is simply all of the records with the same ID .…
ZianyD
  • 171
  • 2
  • 12
2
votes
1 answer

Spark 2.0.0 twitter streaming driver is no longer available

During migration from spark 1.6.2 to spark 2.0.0 appeared that package org.apache.spark.streaming.twitter has been removed and twitter streaming is no longer available as well as dependency org.apache.spark
2
votes
0 answers

Streaming pdf files using spark streaming filestream

I am building an application that scans pdf files and extract data from them. I have already built an application that does batch processing using spark core but now I want the data to be continuously streamed from the directory. How can I use spark…
fady zohdy
  • 45
  • 1
  • 8
2
votes
1 answer

Spark History Logs Are Not Enabled with Oozie Spark Action in Cloudera

I am trying to follow this instructions to enable history logs with Spark Oozie action. https://archive.cloudera.com/cdh5/cdh/5/oozie/DG_SparkActionExtension.html To ensure that your Spark job shows up in the Spark History Server, make sure to…
Alchemist
  • 849
  • 2
  • 10
  • 27