1

I am running the basic NetworkWordCount program on yarn cluster through spark-shell. Here is my code snippet -

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel

val ssc = new StreamingContext(sc, Seconds(60))
val lines = ssc.socketTextStream("172.26.32.34", 9999, StorageLevel.MEMORY_ONLY)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

The output on console and stats on Streaming tab are also as expected. But when I look at jobs tab, per 1-minute batch interval two jobs get triggered, shouldn't it be one job per interval? Screenshot below -

Now when I look at completed batches on Streaming UI, I see exactly one batch per minute.Screenshot below -

Am I missing something? Also, I noticed that the start job also has two states with the same name that spawns a different number of tasks as seen in image below, what exactly is happening here?

enter image description hereenter image description hereenter image description here

mbaxi
  • 1,301
  • 1
  • 8
  • 28
  • Related question (unfortunately also unanswered) https://stackoverflow.com/questions/46162143/why-does-a-single-structured-query-run-multiple-sql-queries-per-batch – Dennis Jaheruddin Jul 31 '20 at 09:38

0 Answers0