0

As I'm new to Spark's Scala API I have the following problem:

In my java code I did a reduceByKeyAndWindow transformation, but now I saw, that there's only a reduceByWindow (as there's also no PairDStream in Scala). However, I got the first steps in Scala working now:

import org.apache.hadoop.conf.Configuration;
import [...]

val serverIp = "xxx.xxx.xxx.xxx"
val receiverInstances = 2
val batchIntervalSec = 2
val windowSize1hSek = 60 * 60
val slideDurationSek = batchIntervalSec

val streamingCtx = new StreamingContext(sc, Seconds(batchIntervalSec))

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "xxx")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "xxx")

// ReceiverInputDStream
val receiver1 = streamingCtx.socketTextStream(serverIp, 7777)
val receiver2 = streamingCtx.socketTextStream(serverIp, 7778)

// DStream
val inputDStream = receiver1.union(receiver2)

// h.hh.plug.ts.val
case class DebsEntry(house: Integer, household: Integer, plug: Integer, ts: Long, value: Float)

// h.hh.plug.val
case class DebsEntryWithoutTs(house: Integer, household: Integer, plug: Integer, value: Float)

// h.hh.plug.1
case class DebsEntryWithoutTsCount(house: Integer, household: Integer, plug: Integer, count: Long)

val debsPairDStream = inputDStream.map(s => s.split(",")).map(s => DebsEntry(s(6).toInt, s(5).toInt, s(4).toInt, s(1).toLong, s(2).toFloat)) //.foreachRDD(rdd => rdd.toDF().registerTempTable("test"))

val debsPairDStreamWithoutDuplicates = debsPairDStream.transform(s => s.distinct())

val debsPairDStreamConsumptionGreater0 = debsPairDStreamWithoutDuplicates.filter(s => s.value > 100.0)

debsPairDStreamConsumptionGreater0.foreachRDD(rdd => rdd.toDF().registerTempTable("test3"))

val debsPairDStreamConsumptionGreater0withoutTs = debsPairDStreamConsumptionGreater0.map(s => DebsEntryWithoutTs(s.house, s.household, s.plug, s.value))

// 5.) Average per Plug
// 5.1) Create a count-prepared PairDStream (house, household, plug, 1)
val countPreparedPerPlug1h = debsPairDStreamConsumptionGreater0withoutTs.map(s => DebsEntryWithoutTsCount(s.house, s.household, s.plug, 1))

// 5.2) ReduceByKeyAndWindow
val countPerPlug1h = countPreparedPerPlug1h.reduceByWindow(...???...)

Until step 5.1 everything works fine. In 5.2 I now want to sum up the 1's of the countPreparedPerPlug1h but only if the other attributes (house, household, plug) are equal. - The goal is to get a entry count per (house, household, plug) combination. Can someone help? Thank you!

EDIT - FIRST TRY

I tried in step 5.2 the following:

// 5.2)
val countPerPlug1h = countPreparedPerPlug1h.reduceByKeyAndWindow((a,b) => a+b, Seconds(windowSize1hSek), Seconds(slideDurationSek))

But here I get the following error:

<console>:69: error: missing parameter type
   val countPerPlug1h = countPreparedPerPlug1h.reduceByKeyAndWindow((a,b) => a+b, Seconds(windowSize1hSek), Seconds(slideDurationSek))
                                                                     ^

Seems that I use the reduceByKeyAndWindow transformation wrong, but where's the mistake? The types of the values to sum up is Int, see countPreparedPerPlug1h in step 5.1 above.

D. Müller
  • 3,336
  • 4
  • 36
  • 84

2 Answers2

2

You can use reduceByKeyAndWindow even simpler in Scala than in your Java version. You don't have a PairDStream as pairs are implicitly determined and you can call pair methods directly. The implicit resolution goes to PairDStreamFunctions

For example:

val myPairDStream: DStream[KeyType, ValueType] = ...
myPairDStream.reduceByKeyAndWindow(...)

which is really the following behind the scenes:

new PairDStreamFunctions(myPairDStream).reduceByKeyAndWindow(...)

This wrapper of PairDStreamFunctions is added to any DStream that is made up of a Tuple2

Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
  • Thank you for the answer! But how can I use the reduceByKeyAndWindow transformation? My attempt didn't work: val countPerPlug1h = countPreparedPerPlug1h.reduceByKeyAndWindow((a,b) => a+b, Seconds(windowSize1hSek), Seconds(slideDurationSek)) – D. Müller Dec 17 '15 at 20:16
  • What is the type of `countPreparedPerPlug1h` and/or what is the error you get? – Justin Pihony Dec 17 '15 at 21:28
  • I added my code and the error message to the question. – D. Müller Dec 18 '15 at 12:49
1

I got it, seems to work now with following code:

val countPerPlug1h = countPreparedPerPlug1h.reduceByKeyAndWindow({(x, y) => x + y}, {(x, y) => x - y}, Seconds(windowSize1hSek), Seconds(slideDurationSek))

Thanks for your clues, @Justin Pihony

D. Müller
  • 3,336
  • 4
  • 36
  • 84