Weighted Average in Spark

Question

I have two RDDs, the first I'll call userVisits that looks like this:

((123, someurl,Mon Nov 04 00:00:00 PST 2013),11.0)

and the second is allVisits:

((someurl,Mon Nov 04 00:00:00 PST 2013),1122.0)

I can do userVisits.reduceByKey(_+_) can get the number of visits by that user. I can do allVisits and get the same. What I want to do is get a weighted average for the users dividing the users visits by the total visits for the day. I need to lookup a value in allVisits with part of the key tuple in user visits. I'm guessing it could be done with a map like this:

userVisits.reduceByKey(_+_).map( item => item._2 / allVisits.get(item._1))

I know allVisits.get(key) doesn't exist, but how could I accomplish something like that?

The alternative is getting the keys from allVisits and mapping each number of keys from userVisits then joining the two, but that seems inefficient.

score 2 · Accepted Answer · answered Dec 17 '15 at 01:58

The only universal option I see here is join:

val userVisitsAgg = userVisits.reduceByKey(_ + _)
val allVisitsAgg = allVisits.reduceByKey(_ + _)

userVisitsAgg.map{case ((id, url, date), sum) => ((url, date), (id, sum))}
  .join(allVisitsAgg)
  .map{case ((url, date), ((id, userSum), (urlSum))) => 
    ((id, url, date), userSum / urlSum)}

If allVisitsAgg is small enough to be broadcasted you can simplify above to something like this:

val allVisitsAggBD = sc.broadcast(allVisitsAgg.collectAsMap)
userVisitsAgg.map{case ((id, url, date), sum) =>
  ((id, url), sum / allVisitsAggBD.value((url, date)))
}

You're welcome. If you're willing to switch to `DataFrames` you can simplify this a little but it shouldn't be a huge difference. — zero323, Dec 17 '15 at 04:02

Weighted Average in Spark

1 Answers1

Linked