2

I have a spark streaming application that reads data from Kafka through network. It is important to note that the cluster and the Kafka servers are in different geographies.

The average time to complete a job is around 8-10 minutes (I am running 10 minute intervals). However in certain batches the job complete time shoots up. The amount by which it shoots up is random in general (could be 20 minutes or 50 minutes or an hour). Upon digging I found that all tasks complete on time except one because of which the whole turnaround time is affected. For example here is the task time log from one such instance:

Tasks

In this case task 6 has taken 54 mins while the others have finished very quickly even though the input split is the same. I have accounted this to network issues (slow/clogged) and am of the opinion that restarting of this task could have saved a lot of time.

Does spark allow some control through which we can restart slow tasks on a different node and then use the results for the task which was completed first? Or does there exist a better solution to this problem that I am unaware of?

Sohaib
  • 4,556
  • 8
  • 40
  • 68
  • If your app has multiple jobs inside it, setting this configuration might help: spark.streaming.concurrentJobs – bistaumanga Jun 22 '16 at 14:09
  • @bistaumanga By multiple jobs you mean? It is a streaming application. It runs a couple of map reduces for each iteration. – Sohaib Jun 23 '16 at 05:31
  • i mean doing multiple things in a same application, like counting number of events, counting by some group, computing average of some key... having multiple actions – bistaumanga Jun 23 '16 at 07:44
  • I do not think the computation phase should take much time and parallelising it would be beneficial. Its basic ETL on around 3 million events per 10 mins divided into 10 partitions running in parallel. – Sohaib Jun 23 '16 at 08:02
  • 1
    then it must be something else, maybe in your network or.... If you solve this problem, do post the solution. It might be helpful for someone... – bistaumanga Jun 23 '16 at 08:15
  • We're running into the same issue, did you ever find a solution? It looks like the streaming application couldn't connect to Kafka because we see `java.nio.channels.ClosedChannelException` in the logs whenever a single 10-minute batch takes much longer than the median. – jackar Aug 23 '16 at 16:35

1 Answers1

1

I would definitely have a look at the spark.speculation.* configuration parameters and set them up to be a lot more aggressive, for example in your case those parameters would be pretty appropriate I think:

  • spark.speculation = true
  • spark.speculation.interval = 1min (How often Spark will check for tasks to speculate.)
  • spark.speculation.multiplier = 1.1 (How many times slower a task is than the median to be considered for speculation.)
  • spark.speculation.quantile = 0.5 (Percentage of tasks which must be complete before speculation is enabled for a particular stage.)

You can find the full list of configuration parameters here.

BenFradet
  • 453
  • 3
  • 10
  • Thanks this definitely looks interesting. I will read up on this and get back if it works for me! – Sohaib Jun 24 '16 at 06:22