I was wondering if there is some way to specify a custom aggregation function for Spark dataframes. If I have a table with 2 columns id
and value
I would like to groupBy id
and aggregate the values into a list for each value
like so:
from:
john | tomato
john | carrot
bill | apple
john | banana
bill | taco
to:
john | tomato, carrot, banana
bill | apple, taco
Is this possible in dataframes? I am asking about dataframes because I am reading data as an orc file and it is loaded as a dataframe. I would think it is in-efficient to convert it to a RDD.