2

I was wondering if there is some way to specify a custom aggregation function for Spark dataframes. If I have a table with 2 columns id and value I would like to groupBy id and aggregate the values into a list for each value like so:

from:

john | tomato
john | carrot
bill | apple
john | banana
bill | taco

to:

john | tomato, carrot, banana
bill | apple, taco

Is this possible in dataframes? I am asking about dataframes because I am reading data as an orc file and it is loaded as a dataframe. I would think it is in-efficient to convert it to a RDD.

zero323
  • 322,348
  • 103
  • 959
  • 935
anthonybell
  • 5,790
  • 7
  • 42
  • 60

2 Answers2

8

I'd just go simply with the following :

import org.apache.spark.sql.functions.collect_list
val df = Seq(("john", "tomato"), ("john", "carrot"), 
             ("bill", "apple"), ("john", "banana"), 
             ("bill", "taco")).toDF("id", "value")
// df: org.apache.spark.sql.DataFrame = [id: string, value: string]

val aggDf = df.groupBy($"id").agg(collect_list($"value").as("values"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, values: array<string>]

aggDf.show(false)
// +----+------------------------+
// |id  |values                  |
// +----+------------------------+
// |john|[tomato, carrot, banana]|
// |bill|[apple, taco]           |
// +----+------------------------+

You won't even need to call the underlying rdd.

eliasah
  • 39,588
  • 11
  • 124
  • 154
2

Reverting to RDD operations tends to work best for problems like this:

scala> val df = sc.parallelize(Seq(("john", "tomato"),
           ("john", "carrot"), ("bill", "apple"), 
           ("john", "bannana"), ("bill", "taco")))
           .toDF("name", "food")
df: org.apache.spark.sql.DataFrame = [name: string, food: string]

scala> df.show
+----+-------+
|name|   food|
+----+-------+
|john| tomato|
|john| carrot|
|bill|  apple|
|john|bannana|
|bill|   taco|
+----+-------+

scala> val aggregated = df.rdd
           .map{ case Row(k: String, v: String) => (k, List(v)) }
           .reduceByKey{_ ++ _}
           .toDF("name", "foods")
aggregated: org.apache.spark.sql.DataFrame = [name: string, foods: array<string>]

scala> aggregated.collect.foreach{println}
[john,WrappedArray(tomato, carrot, bannana)]
[bill,WrappedArray(apple, taco)]

As for efficiency, I believe DataFrames are RDDs under the hood so a conversion like .rdd has very little cost.

evan.oman
  • 5,922
  • 22
  • 43