1

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.

sparkConf.setMaster("spark://rsplws224:7077") 
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream  //batch of 500 ms as i would like to have 1 sec latency 
val filteredDStream = inDStream.filter  // filtering unwanted tuples 
val keyDStream = filteredDStream.map    // converting to pair dstream 
val stateStream = keyDStream .updateStateByKey //updating state for history 

stateStream.checkpoint(Milliseconds(2500))  // to remove long lineage and meterilizing state stream 
stateStream.count()

val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing 
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system 

My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.

I have tried below things but nothing is working

1) spark.locality.wait to 1 sec

2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.

3) increased spark.streaming.concurrentJobs from 1 (default) to 3

4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.

If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.

Edit

One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated

Jigar Parekh
  • 6,163
  • 7
  • 44
  • 64

3 Answers3

2

Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.

Xingjun Wang
  • 413
  • 2
  • 4
  • 17
1

If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.

samthebest
  • 30,803
  • 25
  • 102
  • 142
  • Thx Sam, i have re-verified spark conf it is not taking local URI. any other suggestion? – Jigar Parekh Jun 30 '14 at 14:30
  • Please update the answer to include the line of code that sets it. Does it work when you use the spark-shell??? If it does, then startup the spark shell and run `sc.master` to see what it is. – samthebest Jun 30 '14 at 15:51
  • sorry for late reply, with Spark-shell I got same result. – Jigar Parekh Jul 04 '14 at 10:17
  • @JigarParekh Update your question with the line of code that sets the master, stick in a `println("sc.master = " + sc.master)` in your App. Have you checked in the UI to see if the workers are defo up?? – samthebest Jul 04 '14 at 11:04
  • updated code with line which sets master and verified it with print from stream context. i have also verified other workers are up and running. they are also shown up in app ui executors section – Jigar Parekh Jul 05 '14 at 06:22
  • When you submit the job - does it defo show up in the UI under "Running applications"? If regular jobs show up and distribute, but when you try to use Spark Streaming with the same master URI it doesn't distribute, then I'm afraid I can't help you - that's an odd problem. – samthebest Jul 05 '14 at 10:25
1

By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data. Change this property in spark-defaults.conf file in order to increase the parallelism level.