2

I am working with PySpark on a case where I need to group people based on their interests. Let's say I have n persons:

person1, movies, sports, dramas
person2, sports, trekking, reading, sleeping, movies, dramas
person3, movies, trekking
person4, reading, trekking, sports
person5, movies, sports, dramas
.
.
.

Now I want to group people based on their interests.

  1. Group people who have at least m common interests (m is user input, it could be 2, 3, 4...)

    Let's assume m=3  
    Then the groups are:
          (person1, person2, person5)
          (person2, person4)
    
  2. User who belongs to x groups (x is user input)

    Let's assume x=2
    Then
      person2 is in two groups
    
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
Jack Daniel
  • 2,527
  • 3
  • 31
  • 52

3 Answers3

1

My response will be algebraic and not Spark/Python specific, but can be implemented in Spark.

How can we express the data in your problem? I will go with matrix - each row represents a person, each column represents interest. So following your example:

    movies,sports,trekking,reading,sleeping,dramas
P1: 1  1  0  0  0  1
P2: 1  1  1  1  1  1
P3: 1  0  1  0  0  0
P4: 0  0  1  1  0  1
P5: 1  1  0  0  0  1

What if we would like to investigate similarity of P2 and P3 - check how many interests do they share - we could use the following formula:

(movies)+(sports)+(trekking)+(reading)+(sleeping)+(dramas)
   1*1  +  1*0   +    1*1   +  1*0    +   1*0    +   1*0  = 2

It may look familiar to you - it looks like part of matrix multiplication calculation.

To get full usage of the fact that we observed, we have to transpose the matrix - it will look like that:

         P1,P2,P3,P4,P5
movies   1  1  1  0  1
sports   1  1  0  0  1
trekking 0  1  1  1  0
reading  0  1  0  1  0
sleeping 0  1  0  0  0
dramas   1  1  0  1  1

Now if we multiply the matrices (original and transposed) you would get new matrix:

    P1  P2  P3  P4  P5
P1  3   3   1   1   3
P2  3   6   2   3   4
P3  1   2   2   1   1
P4  1   3   1   2   1
P5  3   3   1   1   3

What you see here is the result you are looking for - check the value on the row/column crossing and you will get number of shared interests.

  • How many interests do P2 share with P4? Answer: 3
  • Who shares 3 interests with P1? Answer: P2 and P5
  • Who shares 2 interests with P3? Answer: P2

Some hints on how to apply this idea into Apache Spark

EDIT 1: Adding more realistic method (after the comments)

We have a table/RDD/Dataset "UserHobby":

    movies,sports,trekking,reading,sleeping,dramas
P1: 1  1  0  0  0  1
P2: 1  1  1  1  1  1
P3: 1  0  1  0  0  0
P4: 0  0  1  1  0  1
P5: 1  1  0  0  0  1

Now to find all the people that share 2 groups with P1 you would have to execute:

SELECT * FROM UserHobby
    WHERE movies*1 + sports*1 + sports*0 + 
    trekking*0 + reading*0+ sleeping*0 + dramas*1 = 2

Now you would have to repeat this query for all the users (changing 0s and 1s to the actual values). The algorithm complexity is O(n^2 * m) - n number of users, m number of hobbies

What is nice about this method is that you don't have to generate subsets.

Community
  • 1
  • 1
Piotr Reszke
  • 1,576
  • 9
  • 21
  • What about the case when there are 100k users? How large will your matrix be? – axiom Sep 14 '16 at 20:52
  • @axiom I know this is more of a mathematical than realistic solution. To make it more real you could use the proposed 0/1 matrix representation, and construct a SQL query that would multiply and sum the columns (as proposed above in P2 P3 comparison), but you would have to run this comparison for each row against all the rest of the rows. So: compare P1 with P2, P3..., PN. compare P2 with P1, P3, P4, .. PN and so on. So it would be O(n^2). Anyway I don't think it can be done any faster. – Piotr Reszke Sep 15 '16 at 07:50
  • I like your answer, but as you note, it isn't realistic for practical datasets. It can be made faster, if one can live with approximations. Please see my answer (part on LSH). – axiom Sep 15 '16 at 16:55
  • Ok, I edited/expanded my answer - to make it more realistic and actually this would work in Spark. I will read more about LSH that you mentioned - I'll get back to your answer :) – Piotr Reszke Sep 15 '16 at 17:40
0

If the number of users is large, you can't possibly think about going for any User x User approach.

Step 0. As a first step, we should ignore all the users who don't have at least m interests (since they cannot have at least m common interests with anyone).

Step 1. Possible approaches:

i)Brute Force: If the maximum number of interests is small, say 10, you can generate all the possible interest combinations in a HashMap, and assign an interest group id to each of these. You will need just one pass over the interest set of a user to find out which interest groups do they qualify for. This will solve the problem in one pass.

ii) Locality Sensitive Hashing: After step 0 is done, we know that we only have users that have a minimum of m hobbies. If you are fine with an approximate answer, locality sensitive hashing can help. How to understand Locality Sensitive Hashing?


A sketch of the LSH approach:

a) First map each of the hobbies to an integer (which you should do anyways, if the dataset is large). So we have User -> Set of integers (hobbies)

b) Obtain a signature for each user, by taking the min hash (Hash each of the integers by a number k, and take the min. This gives User -> Signature

c) Explore the users with the same signature in more detail, if you want (or you can be done with an approximate answer).


Community
  • 1
  • 1
axiom
  • 8,765
  • 3
  • 36
  • 38
0

Probably my answer might not be the best, but will do the work. If you know the total list of hobbies in prior, you can write a piece of code which can compute the combinations before going into spark's part.

For example:

  • Filter out the people whose hobbies_count < input_number at the the very start to scrap out unwanted records.
  • If the total list of hobbies are {a,b,c,d,e,f} and the input_number is 2, the list of combinations in this case would be

              {(ab)(ac)(ad)(ae)(af)(bc)(bd)(be)(bf)(cd)(ce)(cf)(de)(df)(ef)}
    
  • So will need to generate the possible combinations in prior for the input_number

  • Later, perform a filter for each combination and track the record count.
avrsanjay
  • 805
  • 7
  • 12
  • I assumed that the number of unique hobbies to be small. If not, this would be way slower than any other method. – avrsanjay Sep 14 '16 at 21:18