Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
1 answer
[Spark Streaming]How to load the model every time a new message comes in?
In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in.…

Zefu Hu
- 33
- 4
2
votes
3 answers
Can't access kafka.serializer.StringDecoder
I have added the sbt packages fro kafka and spark streaming as follow:
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1",
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1"
however when I wanna use the kafkadirect streams..I cant…

mahdi62
- 959
- 2
- 11
- 17
2
votes
1 answer
Not all Spark Workers are starting: SPARK_WORKER_INSTANCES
I have my spark-defaults.conf configuration like this.
my node has 32Gb RAM. 8 cores.
I am planning to use 16gb and 4 workers with each using 1…

AKC
- 953
- 4
- 17
- 46
2
votes
0 answers
Spark streaming maxRate is violated sometimes
I have a simple Spark Streaming process (1.6.1) which receives data from Azure Event Hub. I am experimenting with back pressure and maxRate settings. This is my configuration:
spark.streaming.backpressure.enabled =…

tmp123
- 23
- 4
2
votes
0 answers
Repartitioning a mapwithstateDstream
I am using a mapwithstate function on another filestream and then doing some actions on that..when I trace my application I see that there are just 2 parttions after my mapwithstate function for mapped1 MapWithStateDStream..I wanted to know if I can…

mahdi62
- 959
- 2
- 11
- 17
2
votes
2 answers
Convert Hive Sql to Spark Sql
i want to convert my Hive Sql to Spark Sql to test the performance of query. Here is my Hive Sql. Can anyone suggests me how to convert the Hive Sql to Spark Sql.
SELECT split(DTD.TRAN_RMKS,'/')[0] AS TRAB_RMK1,
split(DTD.TRAN_RMKS,'/')[1] AS…

Sree Eedupuganti
- 440
- 5
- 15
2
votes
1 answer
Twitter data from spark
I am learning Twitter integretion with Spark streaming.
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import…

subho
- 491
- 1
- 4
- 13
2
votes
2 answers
Spark streaming with kafka - restarting from checkpoint
We are building a fault tolerant system using Spark Streaming and Kafka and are testing checkpointing spark streaming to give us the option of restarting the spark job if it crashes for any reason. Here's what our spark process looks like:
Spark…

Shay
- 505
- 1
- 3
- 19
2
votes
1 answer
Specifying a timeout with mapWithState in Spark Streaming
I am following a sample of mapWithState function on Databricks website.
The codes for trackstatefunction is as follow:
def trackStateFunc(batchTime: Time, key: String, value: Option[Int], state: State[Long]): Option[(String, Long)] = {
val sum =…

mahdi62
- 959
- 2
- 11
- 17
2
votes
0 answers
Stateful stream processing with Spark DataFrames
Is it possible to achieve stateful stream processing with Spark DataFrame API? The first thing I'd like to try is deduplicate the stream. DStream has mapWithState method, but in order to convert it to DataFrames, I have to use foreachRDD:
dStream…

lizarisk
- 7,562
- 10
- 46
- 70
2
votes
0 answers
How to enable dynamic repartitioning in Spark Streaming for uneven data load
I have a use case where input stream data is skewed, volume of data can be from 0 events to 50,000 events per batch. Each data entry is independent of others. Therefore to avoid shuffle caused by repartitioning I want to use some kind of dynamic…

Alchemist
- 849
- 2
- 10
- 27
2
votes
1 answer
Debug, Warn & Info messages from non-main class not visible in spark executor logging
We've tried a variety of solutions including changing the log4j.properties file, copying the file to the executors via --file and then telling them to use it as an arg passed to spark via --conf and also tried updating the configuration of the EMR…

null
- 3,469
- 7
- 41
- 90
2
votes
1 answer
Spark Streaming: How to load a Pipeline on a Stream?
I am implementing a lambda architecture system for stream processing.
I have no issue creating a Pipeline with GridSearch in Spark Batch:
pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor])
paramGrid =…

Manuel G
- 1,523
- 1
- 21
- 34
2
votes
2 answers
Submitting Spark Job On Scheduler Pool
I am running a spark streaming job on cluster mode , i have created a pool with memory of 200GB(CDH). I wanted to run my spark streaming job on that pool, i tried setting
sc.setLocalProperty("spark.scheduler.pool", "pool")
in code but its not…

Justin
- 735
- 1
- 15
- 32
2
votes
0 answers
Failure to reload from checkpoint directory
When I tried reloading my spark streaming application from a checkpoint directory, I got the following exception:
java.lang.IllegalArgumentException: requirement failed: Checkpoint directory does not exist:…

mahdi62
- 959
- 2
- 11
- 17