0

I am now using Spark Streaming + Kafka to construct my message processing system.But I have a little technical problem , I will describe it below:

For example , I want to do a wordcount for each 10 minutes,So, in my earliest code,I set Batch Interval to 10 minutes.Code is like below:

 val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
 val ssc = new StreamingContext(sparkConf, Minutes(10))

But I don't think it is a very good solution because 10 minutes is what a long time and large amount of data that my memory cannot sustain so much data.So , I want to reduce batch interval to 1 minutes, like:

 val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
 val ssc = new StreamingContext(sparkConf, Minutes(1))

Then the problem comes:How can I sum up the result of 10 minutes for ten '1 minutes'? I think this word can only be done in driver instead of worker program,what can I do?

I am new learner of Spark Streaming.Any one can give me a hand?

wuchang
  • 3,003
  • 8
  • 42
  • 66

2 Answers2

0

Maybe I have my idea. In this condition ,I should use stateful function like UpdateStateByKey() because , since what I want is a global 10 minutes' result but what I can get is just each intermediate result of each 1 minute , so before each 10 minutes end , I have to record the state of each 1 minute , such as the word count result of each 1 minute and add them up for each 1 minute.

wuchang
  • 3,003
  • 8
  • 42
  • 66
0

Posting here as I had a similar issue and came across the Window Operations section of Spark Streaming. In the poster's original case, they want a count for the past 10 minutes, done every 10 minutes although their program calculates counts each 1 minute. Assuming we have counts defined and calculated as the standard word count (i.e. at a 1-minute batch duration, with tuples (word, count)), we could follow the linked guide and define something along the lines of

// Reduce/count last 10 seconds worth of data, every 10 seconds
val windowedWordCounts = counts.reduceByKeyAndWindow(_+_, Seconds(10), Seconds(10))

where _+_ is a sum function.

Petter Friberg
  • 21,252
  • 9
  • 60
  • 109
Jia Teoh
  • 1
  • 1