I share the code that I have:
// define a case class
case class Zone(id: Int, team: String, members: Int ,name: String, lastname: String)
val df = Seq (
(1,"team1", 3, "Jonh", "Doe"),
(1,"team2", 4, "Jonh", "Doe"),
(1,"team3", 5, "David", "Luis"),
(2,"team4", 6, "Michael", "Larson"))
.toDF("id", "team", "members", "name", "lastname").as[Zone]
val df_grouped = df
.withColumn("team_info", to_json(struct(col("team"), col("members"))))
.withColumn("users", to_json(struct(col("name"), col("lastname"))))
.groupBy("id")
.agg(collect_list($"team_info").alias("team_info"), collect_list($"users").alias("users"))
df_grouped.show
+---+--------------------+--------------------+
| id| team_info| users|
+---+--------------------+--------------------+
| 1|[{"team":"team1",...|[{"name":"Jonh","...|
| 2|[{"team":"team4",...|[{"name":"Michael...|
+---+--------------------+--------------------+
I need to remove duplicates inside column "users" because in my case if the json inside the array are exactly the same are duplicates. Is there any way to do it changing the value of that column with df.withColumn or any other approach?