Spark aggregateByKey on Dataset

Question

Here's an example of aggregateByKey on mutable.HashSet[String] written by @bbejeck

val initialSet = mutable.HashSet.empty[String]
val addToSet = (s: mutable.HashSet[String], v: String) => s += v
val mergePartitionSets = (p1: mutable.HashSet[String], p2: mutable.HashSet[String]) => p1 ++= p2
val uniqueByKey = kv.aggregateByKey(initialSet)(addToSet, mergePartitionSets)

But when I changed to Dataset, I got the following error, is that because Spark 2.0(the version I'm using) doesn't support aggregateByKey on Dataset?

java.lang.NullPointerException
at org.apache.spark.sql.Dataset.schema(Dataset.scala:393)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:339)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
at org.apache.spark.sql.Dataset.show(Dataset.scala:495)

Here's the code:

case class Food(name: String,
                price: String,
                e_date: String)    
rdd.aggregateByKey( Seq(Food("", "", "")).toDS )( 
                    (f1,f2) => f1.union(f2), 
                    (f1,f2) => f1.union(f2))
/////////
found f1 = Invalid tree; null:
                    null

Any ideas why this is happening, thank you in advance!

score 2 · Accepted Answer · answered Aug 25 '16 at 14:58

Yes, I think aggregateByKey works with rdd only.
here is the documentation (it's for python)
http://spark.apache.org/docs/latest/api/python/pyspark.html

Remove .toDS and try the code. Maybe convert it into DS after the aggregation is done (not sure if it would be any better in performance).

Spark aggregateByKey on Dataset

1 Answers1