1

I have the following code and output from Aggregating multiple columns with custom function in Spark.

import org.apache.spark.sql.functions.{collect_list, struct}
import sqlContext.implicits._

val df = Seq(
  ("john", "tomato", 1.99),
  ("john", "carrot", 0.45),
  ("bill", "apple", 0.99),
  ("john", "banana", 1.29),
  ("bill", "taco", 2.59)
).toDF("name", "food", "price")

df.groupBy($"name")
  .agg(collect_list(struct($"food", $"price")).as("foods"))
  .show(false)

df.printSchema

Output and Schema:

+----+---------------------------------------------+
|name|foods                                        |
+----+---------------------------------------------+
|john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
|bill|[[apple,0.99], [taco,2.59]]                  |
+----+---------------------------------------------+

root
 |-- name: string (nullable = true)
 |-- foods: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- food: string (nullable = true)
 |    |    |-- price: double (nullable = false)

I want to sort based on df("foods.food"). How do I sort this to get the output below?

+----+---------------------------------------------+
|name|foods                                        |
+----+---------------------------------------------+
|john|[[banana,1.29], [carrot,0.45], [tomato,1.99]]|
|bill|[[apple,0.99], [taco,2.59]]                  |
+----+---------------------------------------------+

Edit: I want to be able to choose which attribute to sort on. For example, if I want to sort on the price, I want an output like this:

+----+---------------------------------------------+
|name|foods                                        |
+----+---------------------------------------------+
|john|[[carrot,0.45], [banana,1.29], [tomato,1.99]]|
|bill|[[apple,0.99], [taco,2.59]]                  |
+----+---------------------------------------------+
  • Use `array_sort` with a custom comparator function, see this [post](https://stackoverflow.com/a/71109379/1386551) – blackbishop Feb 14 '22 at 09:12

1 Answers1

0

You can use the sort_array function. refer to here

df.groupBy($"name")
  .agg(sort_array(collect_list(struct($"food", $"price"))).as("foods"))
  .show(false)
过过招
  • 3,722
  • 2
  • 4
  • 11
  • thanks! but what if I want to sort based on the price attribute? I also want to be able to chose which attribute to sort on – Illustrious Imp Feb 14 '22 at 01:56
  • 1
    Does this answer your question?https://stackoverflow.com/questions/49671354/how-to-sort-array-of-struct-type-in-spark-dataframe-by-particular-column. Another way: Construct ```struct``` with ````price``` first. struct($"price", $"food"). – 过过招 Feb 14 '22 at 02:23