0

So I am new to Scala and just starting to work with RDDs and functional Scala operations.

I am trying to iterate over the values of my Pair RDDs and return Var1 with the average of the values stored in Var2 by applying the defined averagefunction so that the final return is a unique list of Var1 with a single AvgVar2 associated with each one. I am having a lot of trouble figuring out how to iterate over the values.

*edit: I have the following type declarations:

case class ID: Int,  Var1: Int, Var2: Int extends Serializable

I have the following function:

  def foo(rdds: RDD[(ID, Iterable[(Var1, Var2)])]): RDD[(Var1, AvgVar2)] = {

    def average(as: Array[Var2]): AvgVar2 = {
       var sum = 0.0
       var i = 0.0
       while (i < as.length) {
           sum += Var2.val
           i += 1
      }
      sum/i
    }

    //My attempt at Scala
    rdds.map(x=> ((x._1),x._2)).groupByKey().map(x=>average(x._1)).collect()
}

My attempt at Scala is trying to do the following:

  1. split the RDD pair Iterable into key-value pairs of Var1-Var2.
  2. Group by the key of Var1 and create an array of associated Var2.
  3. Apply my average function to each array of Var2
  4. Return the AvgVar2 with the associated Var1 as a collection of RDDs

*Edit:

Some sample input data for rdds:

//RDD[(ID,Iterable[(Var1,Var2)...])]
RDD[(1,[(1,3),(1,12),(1,6)])],
RDD[(2,[(2,5),(2,7)])]

Some sample output data:

//RDD[(Var1, AvgVar2)]
RDD[(1,7),(2,6)]

*Edit: Line of working scala code:

rdd.map(x => (x._2.map(it => it._1).asInstanceOf[Var1], average(x._2.map(it => it._2).toArray)))
EliSquared
  • 1,409
  • 5
  • 20
  • 44
  • Is there are any specific reason you decided to use RDD instead of DataFrames? – Pavel Feb 01 '19 at 08:36
  • 1
    Can you provide a sample input & output?. Can same `Var1` value occur in different `ID` ? – vdep Feb 01 '19 at 09:48
  • @vdep, each unique `Var1` value can only be associated with a single ID, but I want to drop the ID in the return RDD. I have added in sample input and output data in my edit that I hope clarifies the question. – EliSquared Feb 01 '19 at 18:28

1 Answers1

1

Considering ID = Var1, a simple .map() will solve it:

def foo(rdds: RDD[(Int, Iterable[(Int, Int)])]): RDD[(Int, Double)] = {

  def average(as: Iterable[(Int, Int)]): Double = {
    as.map(_._2).reduce(_+_)/as.size.toDouble
  }

  rdds.map(x => (x._1, average(x._2)))
}

Output:

val input = sc.parallelize(List((1,Iterable((1,3),(1,12),(1,6))), (2, Iterable((2,5),(2,7)))))

scala> foo(input).collect
res0: Array[(Int, Double)] = Array((1,7.0), (2,6.0))

EDITED: (average() with same signature):

def foo(rdds: RDD[(Int, Iterable[(Int, Int)])]): RDD[(Int, Double)] = {

  def average(as: Array[Int]): Double = {
    as.reduce(_+_)/as.size.toDouble
  }

  rdds.map(x => (x._1, average(x._2.map(tuple => tuple._2).toArray)))
}
vdep
  • 3,541
  • 4
  • 28
  • 54
  • I appreciate the response, but would it be possible for you to do without changing the form of the `average` function? I want to understand how to create an array from the various `Var2` so I can apply any function of signature `average(as: Array[Var2])` to an RDD in the given form. Additionally, while the `ID == Var1`, they are different object types so that would cause the function to throw an error. I'll admit this is for a homework assignment so I need the types to match exactly. – EliSquared Feb 01 '19 at 19:06
  • 2
    @EliSquared edited the answer. I am Not sure what do you mean by: " ID == Var1, they are different object types so that would cause the function to throw an error" – vdep Feb 01 '19 at 19:13
  • I mean that `ID` and `Var1` are each defined classes (see edit above the function). Now the only error I am having is on the `x._1` which returns an `ID`, when I need it to return the first element of the Iterable (`Var1`). I am now trying something like this: `rdds.map(x => (x._2.map(it=> it._1), average(x._2.map(tuple => tuple._2).toArray)))`, to try and select the first element of the iterable, but am still getting a type error. – EliSquared Feb 01 '19 at 23:42
  • I actually got it to work, I just had to added in: `.asInstanceOf[Var1]` after `(x._2.map(it=> it._1)` => `(x._2.map(it=> it._1).asInstanceOf[Var1]`. See fully working code line above. That being said if you can optimize for efficiency or readability, I would be interested. – EliSquared Feb 02 '19 at 00:37