3

I want to use spark streaming for reading data from the HDFS. The idea is that another program will keep on uploading new files to an HDFS directory, which my spark streaming job would process. However, I also want to have an end condition. That is, a way in which the program uploading files to the HDFS can signal the spark streaming program, that it is done uploading all the files.

For a simple example, take the program from Here. The code is shown below. Assuming another program is uploading those files, how can the end condition be progammatically signalled by that program (Not requiring us to press CTRL+C) to the spark streaming program?

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage StreamingWordCount <input-directory> <output-directory>")
      System.exit(0)
    }
    val inputDir=args(0)
    val output=args(1)
    val conf = new SparkConf().setAppName("Spark Streaming Example")
    val streamingContext = new StreamingContext(conf, Seconds(10))
    val lines = streamingContext.textFileStream(inputDir)
    val words = lines.flatMap(_.split(" "))
    val wc = words.map(x => (x, 1))
    wc.foreachRDD(rdd => {
      val counts = rdd.reduceByKey((x, y) => x + y)
      counts.saveAsTextFile(output)
      val collectedCounts = counts.collect
      collectedCounts.foreach(c => println(c))
    }
    )

    println("StreamingWordCount: streamingContext start")
    streamingContext.start()
    println("StreamingWordCount: await termination")
    streamingContext.awaitTermination()
    println("StreamingWordCount: done!")
  }
}
pythonic
  • 20,589
  • 43
  • 136
  • 219
  • Could you add some control bytes to the end of your job uploading data and then watch for those bytes in your Spark Streaming program and terminate when those bytes are matched? Add something like 0x1c0x0d? Also, why use Spark streaming for this use case and not kick off another job after you upload files? – pjames Oct 10 '17 at 02:14

1 Answers1

3

OK, I got it. Basically you create another thread from where you call ssc.stop(), to signal the stream processing to stop. For example, like this.

val ssc = new StreamingContext(sparkConf, Seconds(1))
//////////////////////////////////////////////////////////////////////
val thread = new Thread 
{
    override def run 
    {
        ....
        // On reaching the end condition
        ssc.stop()
    }
}
thread.start
//////////////////////////////////////////////////////////////////////
val lines = ssc.textFileStream("inputDir")
.....
pythonic
  • 20,589
  • 43
  • 136
  • 219