2

Here's an example of aggregateByKey on mutable.HashSet[String] written by @bbejeck

val initialSet = mutable.HashSet.empty[String]
val addToSet = (s: mutable.HashSet[String], v: String) => s += v
val mergePartitionSets = (p1: mutable.HashSet[String], p2: mutable.HashSet[String]) => p1 ++= p2
val uniqueByKey = kv.aggregateByKey(initialSet)(addToSet, mergePartitionSets)

But when I changed to Dataset, I got the following error, is that because Spark 2.0(the version I'm using) doesn't support aggregateByKey on Dataset?

java.lang.NullPointerException
at org.apache.spark.sql.Dataset.schema(Dataset.scala:393)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:339)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
at org.apache.spark.sql.Dataset.show(Dataset.scala:495)

Here's the code:

case class Food(name: String,
                price: String,
                e_date: String)    
rdd.aggregateByKey( Seq(Food("", "", "")).toDS )( 
                    (f1,f2) => f1.union(f2), 
                    (f1,f2) => f1.union(f2))
/////////
found f1 = Invalid tree; null:
                    null

Any ideas why this is happening, thank you in advance!

faustineinsun
  • 451
  • 1
  • 6
  • 16

1 Answers1

2

Yes, I think aggregateByKey works with rdd only.
here is the documentation (it's for python)
http://spark.apache.org/docs/latest/api/python/pyspark.html

Remove .toDS and try the code. Maybe convert it into DS after the aggregation is done (not sure if it would be any better in performance).

Bigby
  • 321
  • 5
  • 16