1

I have been trying to solve this problem but can't really connect it with any solution. I have following data set:

[
  {"name": "sam", "hobbies": ["Books", "Music", "Gym"]},
  {"name": "Steve", "hobbies": ["Books", "Swimming"]},
  {"name": "Alex", "hobbies": ["Gym", "Music"]}
]

I am trying to generate output dataset that can combine people by hobbies. So output should look something like this:

[
  {"names": ["sam", "Steve"], "hobbies": ["Books"]},
  {"names": ["sam", "Alex"], "hobbies": ["Music", "Gym"]},
  {"names": ["Steve"], "hobbies": ["Swimming"]}
]

Its a large dataset so I was trying to use Spark.

Things I have tried:

  • Initially I was trying to see if its a graph problem and I can use something like strongly connected components, but looks like that won't solve the problem.

  • Each output row looks like a bipartite graph but I was not able to find a way to generate that as well.

  • Another approach was clustering but I thought it will not be deterministic. Please let me know if I am wrong. I am not too familiar with it.

Let me know if I am missing something obvious here. Thanks.

webdev
  • 598
  • 5
  • 16
  • you can use ```explode``` hobbies and then use ```groupBy``` & ```collect list``` on name & hobbies columns. – Srinivas Nov 13 '20 at 11:13
  • @Srinivas yes but then how do I find partial lists that are shared hobbies of names? – webdev Nov 13 '20 at 16:24

1 Answers1

1

Check below code.

scala> df.show(false)
+-------------------+-----+
|hobbies            |name |
+-------------------+-----+
|[Books, Music, Gym]|sam  |
|[Books, Swimming]  |Steve|
|[Gym, Music]       |Alex |
+-------------------+-----+

Use groupBy & collect_list

  1. Group By hobbies & Collect List of names
  2. Group By names & Collect List of hobbies
scala> :paste
// Entering paste mode (ctrl-D to finish)

df
.withColumn("hobbies",explode($"hobbies"))
.groupBy($"hobbies").agg(collect_list($"name").as("names")) // For Hobbies List
.groupBy($"name").agg(collect_list($"hobbies").as("hobbies")) // For Name List
.select(collect_list(to_json(struct($"hobbies",$"names"))).as("data")) // Final Json Output
.show(false)


// Exiting paste mode, now interpreting.

+--------------------------------------------------------------------------------------------------------------------------------------------+
|data                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{"hobbies":["Swimming"],"names":["Steve"]}, {"hobbies":["Books"],"names":["sam","Steve"]}, {"hobbies":["Music","Gym"],"names":["sam","Alex"]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+

Formatted Output

[
  { "hobbies": ["Swimming"],"names": ["Steve"]},
  {"hobbies": ["Books"],"names": ["sam","Steve"]},
  {"hobbies": ["Music","Gym"],"names": ["sam","Alex"]}
]
Srinivas
  • 8,957
  • 2
  • 12
  • 26
  • I am assuming the second groupBy is on $"names", agg output of first groupBy. I think this will work. I will try it. Thanks – webdev Nov 13 '20 at 19:00