I have been trying to solve this problem but can't really connect it with any solution. I have following data set:
[
{"name": "sam", "hobbies": ["Books", "Music", "Gym"]},
{"name": "Steve", "hobbies": ["Books", "Swimming"]},
{"name": "Alex", "hobbies": ["Gym", "Music"]}
]
I am trying to generate output dataset that can combine people by hobbies. So output should look something like this:
[
{"names": ["sam", "Steve"], "hobbies": ["Books"]},
{"names": ["sam", "Alex"], "hobbies": ["Music", "Gym"]},
{"names": ["Steve"], "hobbies": ["Swimming"]}
]
Its a large dataset so I was trying to use Spark.
Things I have tried:
Initially I was trying to see if its a graph problem and I can use something like strongly connected components, but looks like that won't solve the problem.
Each output row looks like a bipartite graph but I was not able to find a way to generate that as well.
Another approach was clustering but I thought it will not be deterministic. Please let me know if I am wrong. I am not too familiar with it.
Let me know if I am missing something obvious here. Thanks.