I have the following code and output from Aggregating multiple columns with custom function in Spark.
import org.apache.spark.sql.functions.{collect_list, struct}
import sqlContext.implicits._
val df = Seq(
("john", "tomato", 1.99),
("john", "carrot", 0.45),
("bill", "apple", 0.99),
("john", "banana", 1.29),
("bill", "taco", 2.59)
).toDF("name", "food", "price")
df.groupBy($"name")
.agg(collect_list(struct($"food", $"price")).as("foods"))
.show(false)
df.printSchema
Output and Schema:
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
|bill|[[apple,0.99], [taco,2.59]] |
+----+---------------------------------------------+
root
|-- name: string (nullable = true)
|-- foods: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- food: string (nullable = true)
| | |-- price: double (nullable = false)
I want to sort based on df("foods.food")
. How do I sort this to get the output below?
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[banana,1.29], [carrot,0.45], [tomato,1.99]]|
|bill|[[apple,0.99], [taco,2.59]] |
+----+---------------------------------------------+
Edit: I want to be able to choose which attribute to sort on. For example, if I want to sort on the price, I want an output like this:
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[carrot,0.45], [banana,1.29], [tomato,1.99]]|
|bill|[[apple,0.99], [taco,2.59]] |
+----+---------------------------------------------+