My response will be algebraic and not Spark/Python specific, but can be implemented in Spark.
How can we express the data in your problem?
I will go with matrix - each row represents a person, each column represents interest. So following your example:
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
What if we would like to investigate similarity of P2 and P3 - check how many interests do they share - we could use the following formula:
(movies)+(sports)+(trekking)+(reading)+(sleeping)+(dramas)
1*1 + 1*0 + 1*1 + 1*0 + 1*0 + 1*0 = 2
It may look familiar to you - it looks like part of matrix multiplication calculation.
To get full usage of the fact that we observed, we have to transpose the matrix - it will look like that:
P1,P2,P3,P4,P5
movies 1 1 1 0 1
sports 1 1 0 0 1
trekking 0 1 1 1 0
reading 0 1 0 1 0
sleeping 0 1 0 0 0
dramas 1 1 0 1 1
Now if we multiply the matrices (original and transposed) you would get new matrix:
P1 P2 P3 P4 P5
P1 3 3 1 1 3
P2 3 6 2 3 4
P3 1 2 2 1 1
P4 1 3 1 2 1
P5 3 3 1 1 3
What you see here is the result you are looking for - check the value on the row/column crossing and you will get number of shared interests.
- How many interests do P2 share with P4? Answer: 3
- Who shares 3 interests with P1? Answer: P2 and P5
- Who shares 2 interests with P3? Answer: P2
Some hints on how to apply this idea into Apache Spark
EDIT 1: Adding more realistic method (after the comments)
We have a table/RDD/Dataset "UserHobby":
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
Now to find all the people that share 2 groups with P1 you would have to execute:
SELECT * FROM UserHobby
WHERE movies*1 + sports*1 + sports*0 +
trekking*0 + reading*0+ sleeping*0 + dramas*1 = 2
Now you would have to repeat this query for all the users (changing 0s and 1s to the actual values). The algorithm complexity is O(n^2 * m) - n number of users, m number of hobbies
What is nice about this method is that you don't have to generate subsets.