1

I am fairly new Spark Streaming

I have the streaming data containing two values x y. For example

1 300

2 8754

3 287

etc.

Out of the streamed data, I want to get the smallest y value, largest y value, and the mean of the x values. This needs to be output as follows (using the example above):

287 8754 4

I have been able to calculate these value on an individual transform/reduce but fail to do with a single transformation

Here is my current code below

val transformedStream = windowStream.map(line => {
  Array(line.split(" ")(0).toLong, line.split(" ")(1).toLong)

val smallest: DStream[Double]  = transformedStream.reduce((a,b) => {
  Array(0, math.min(a(1), b(1)))
}).map(u => u(1).toDouble)

val biggest  = transformedStream.reduce((a,b) => {
  Array(0, math.max(a(1), b(1)))
}).map(u => u(1).toDouble)

val mean = transformedStream.reduce((a, b) => Array( (a(0) + b(0))/2 )).
  map(u => u(0).toDouble)
Community
  • 1
  • 1
Tsume
  • 907
  • 2
  • 11
  • 21

1 Answers1

2

Try this:

val spark: SparkSession = ???
import spark.implicits._

windowStream.transofrm( rdd => {
  val xy = rdd.map(_.split(" ")).collect {
    case Array(x, y) => (x.toLong, y.toLong)
  }
  xy.toDF("x", "y").agg(min("y"), max("y"), mean("x"))
  .as[(Long, Long, Double)].rdd
})

Important:

transformedStream.reduce((a, b) => Array( (a(0) + b(0))/2 )  

doesn't compute mean of x.