28

I'm trying to learn to use DataFrames and DataSets more in addition to RDDs. For an RDD, I know I can do someRDD.reduceByKey((x,y) => x + y), but I don't see that function for Dataset. So I decided to write one.

someRdd.map(x => ((x.fromId,x.toId),1)).map(x => collection.mutable.Map(x)).reduce((x,y) => {
  val result = mutable.HashMap.empty[(Long,Long),Int]
  val keys = mutable.HashSet.empty[(Long,Long)]
  y.keys.foreach(z => keys += z)
  x.keys.foreach(z => keys += z)
  for (elem <- keys) {
    val s1 = if(x.contains(elem)) x(elem) else 0
    val s2 = if(y.contains(elem)) y(elem) else 0
    result(elem) = s1 + s2
  }
  result
})

However, this returns everything to the driver. How would you write this to return a Dataset? Maybe mapPartition and do it there?

Note this compiles but does not run because it doesn't have encoders for Map yet

Carlos Bribiescas
  • 4,197
  • 9
  • 35
  • 66
  • 1
    with Spark 2.0.0, try this , yourDataset.groupByKey(...).reduceGroups(...) – FelixHo Jul 31 '16 at 16:38
  • 9
    Will the catalyst optimizer notice you're doing a group followed by a reduce and make it more efficient? By 'efficient' I mean in terms of how on an RDD doing a reduce by key is better than doing a group by then reduce by? – Carlos Bribiescas Aug 01 '16 at 13:40

2 Answers2

40

I assume your goal is to translate this idiom to Datasets:

rdd.map(x => (x.someKey, x.someField))
   .reduceByKey(_ + _)

// => returning an RDD of (KeyType, FieldType)

Currently, the closest solution I have found with the Dataset API looks like this:

ds.map(x => (x.someKey, x.someField))          // [1]
  .groupByKey(_._1)                            
  .reduceGroups((a, b) => (a._1, a._2 + b._2))
  .map(_._2)                                   // [2]

// => returning a Dataset of (KeyType, FieldType)

// Comments:
// [1] As far as I can see, having a map before groupByKey is required
//     to end up with the proper type in reduceGroups. After all, we do
//     not want to reduce over the original type, but the FieldType.
// [2] required since reduceGroups converts back to Dataset[(K, V)]
//     not knowing that our V's are already key-value pairs.

Doesn't look very elegant and according to a quick benchmark it is also much less performant, so maybe we are missing something here...

Note: An alternative might be to use groupByKey(_.someKey) as a first step. The problem is that using groupByKey changes the type from a regular Dataset to a KeyValueGroupedDataset. The latter does not have a regular map function. Instead it offers an mapGroups, which does not seem very convenient because it wraps the values into an Iterator and performs a shuffle according to the docstring.

bluenote10
  • 23,414
  • 14
  • 122
  • 178
  • 9
    This does the trick. Just a note though, reduceByKey is more efficient because it reduces on each node before shuffling. Doing groupByKey first shuffles all the elements then starts reducing. That is why its much less performant. Whats funny is that this is what I used to do before I knew about reduceByKey but I had forgotten :-) – Carlos Bribiescas Aug 06 '16 at 15:23
  • 1
    @CarlosBribiescas I have read on the interwebs that Datasets take advantage of Sparks' Catalyst Optimizer, and should be able to push down the reduce function before shuffling. This may explain why there is no `reduceByKey` in the `Dataset` API. However, in my experience this is not the case and `groupByKey.reduceGroups` shuffles significantly more data and is significantly slower than `reduceByKey`. – Justin Raymond Nov 09 '16 at 21:15
  • 11
    Seems that reduceGroups performance has been fixed from 2.0.1 & 2.1.0 [Spark-16391](https://issues.apache.org/jira/browse/SPARK-16391) – Franzi Apr 25 '17 at 20:50
  • Ah, yeah. From the sounds of it, it looks like it works just like reduceByKey now. Do you know if there are any plans to implement reduceByKey? This technically works but is much more verbose. – Carlos Bribiescas Oct 22 '17 at 19:20
  • This solution works for me, thanks! But I have a question: any idea why `reduceByKey` doesn't support pattern matching? For clarity, I would like to be able to write `reduceByKey{ case ((k1, v1), (k2, v2)) => (k1, v1 + v2) }`, but the compiler doesn't like it, even if I add type annotation on the left-hand side. – Paul Siegel Nov 05 '17 at 22:30
9

A more efficient solution uses mapPartitions before groupByKey to reduce the amount of shuffling (note this is not the exact same signature as reduceByKey but I think it is more flexible to pass a function than require the dataset consist of a tuple).

def reduceByKey[V: ClassTag, K](ds: Dataset[V], f: V => K, g: (V, V) => V)
  (implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, V)] = {
  def h[V: ClassTag, K](f: V => K, g: (V, V) => V, iter: Iterator[V]): Iterator[V] = {
    iter.toArray.groupBy(f).mapValues(_.reduce(g)).map(_._2).toIterator
  }
  ds.mapPartitions(h(f, g, _))
    .groupByKey(f)(encK)
    .reduceGroups(g)
}

Depending on the shape/size of your data, this is within 1 second of the performance of reduceByKey, and about 2x as fast as a groupByKey(_._1).reduceGroups. There is still room for improvements, so suggestions would be welcome.

Justin Raymond
  • 3,413
  • 2
  • 19
  • 28