Weighted average with Spark Datasets without UDF

Question

While someone has already asked about computing a Weighted Average in Spark, in this question, I'm asking about using Datasets/DataFrames instead of RDDs.

How do I compute a weighted average in Spark? I have two columns: counts and previous averages:

case class Stat(name:String, count: Int, average: Double)
val statset = spark.createDataset(Seq(Stat("NY", 1,5.0),
                           Stat("NY",2,1.5),
                           Stat("LA",12,1.0),
                           Stat("LA",15,3.0)))

I would like to be able to compute a weighted average like this:

display(statset.groupBy($"name").agg(sum($"count").as("count"),
                    weightedAverage($"count",$"average").as("average")))

One can use a UDF to get close:

val weightedAverage = udf(
  (row:Row)=>{
    val counts = row.getAs[WrappedArray[Int]](0)
    val averages = row.getAs[WrappedArray[Double]](1)
    val (count,total) = (counts zip averages).foldLeft((0,0.0)){
      case((cumcount:Int,cumtotal:Double),(newcount:Int,newaverage:Double))=>(cumcount+newcount,cumtotal+newcount*newaverage)}
    (total/count)  // Tested by returning count here and then extracting. Got same result as sum.
  }
)

display(statset.groupBy($"name").agg(sum($"count").as("count"),
                    weightedAverage(struct(collect_list($"count"),
                                    collect_list($"average"))).as("average")))

(Thanks to answers to Passing a list of tuples as a parameter to a spark udf in scala for help in writing this)

Newbies: Use these imports:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection.mutable.WrappedArray

Is there a way of accomplishing this with built-in column functions instead of UDFs? The UDF feels clunky and if the numbers get large you have to convert the Int's to Long's.

score 5 · Accepted Answer · edited Aug 11 '17 at 18:00

5

Looks like you could do it in two passes:

val totalCount = statset.select(sum($"count")).collect.head.getLong(0)

statset.select(lit(totalCount) as "count", sum($"average" * $"count" / lit(totalCount)) as "average").show

Or, including the groupBy you just added:

display(statset.groupBy($"name").agg(sum($"count").as("count"),
                    sum($"count"*$"average").as("total"))
               .select($"name",$"count",($"total"/$"count")))

edited Aug 11 '17 at 18:00

Josiah Yoder

3,321
4
40
58

answered Aug 10 '17 at 20:12

Michel Lemay

2,054
2
17
34

I would add the total count as another column in the second aggregation and then do the division on the end. The second pass would need to go over much less data. – Assaf Mendelson Aug 11 '17 at 06:14
@MichelLemay: Thanks! That's just what I needed to jog my thinking. I've suggested an edit to your answer that also works with groupBy. – Josiah Yoder Aug 11 '17 at 14:52
you can accept the answer @JosiahYoder if it helped you – Ramesh Maharjan Aug 12 '17 at 08:08

Weighted average with Spark Datasets without UDF

1 Answers1