1

I submit spark streaming job to calculate site pages UV by mapWithState func, Complete batches has very instability processing time from beginning.

enter image description here

Those batches(>45s) took long time at first stage in job. But first stage is only include receive message from kafka brokers(CreateDirectStream) and simple map for test.

stream
  .map(
  x => {
      Tuple2[String,Tuple3[Int,HLL,String]](getMd5(0.0 + ";" + getFormatDate(1600000000l) + ";" +"1;1;1;1;1;1;1;1;1;1;1;1"),Tuple3(1, hyperLogLog(getMd5("1").getBytes(Charsets.UTF_8)), "0.0" + ";" + getFormatDate(1600000000l) + ";" +"1;1;1;1;1;1;1;1;1;1;1;1"))
  })

enter image description here

I don't known why some batches took 40+ seconds to frist stage. Is kafka receiving instablity? Kafka and Spark In the LAN.

Nelson
  • 183
  • 3
  • 9
  • My gut feeling says: https://stackoverflow.com/questions/36042295/spark-streaming-mapwithstate-seems-to-rebuild-complete-state-periodically/36065778#36065778 – Yuval Itzchakov Dec 05 '17 at 11:47
  • my question is that tooks >45s batch is not checkpoint. I set checkpoint interval 600seconds, – Nelson Dec 05 '17 at 14:01
  • If you set it to 600 seconds your RDD linage will grow slower and slower until it stops working. Checkpoint also cuts the graph linage. – Yuval Itzchakov Dec 05 '17 at 14:18
  • Batch interval 60 seconds and checkpoint interval 600 seconds. It means every 10 batches processed will do `checkpoint()`, liked you say `Checkpoint cuts the graph linage`. But i don't know why batches took 45 seconds? because there is no checkpoint. – Nelson Dec 06 '17 at 01:48

0 Answers0