0

I'm not really sure how to go about this problem. I've tried using K-Means but it doesn't seem to be working.

I want to match Person A from Set A with Person B from Set B based on their interests.

The dataframes have person ID, and columns related to interests (n=35). Interest choices are the same for both Sets.

Interests are rated on a scale of 1-5, with 1 being top interest and 5 being the lowest interest. People can only choose 5 interests in total, so any interest not chosen is denoted with a 0.

Example:

    >>> dfA
Id     interest1     interest2     interest3 ..  interest35
A1         1              4             2             0
A2         0              0             0             0
A3         5              2             0             0

    >>> dfB
Id     interest1     interest2     interest3 ..  interest35
B1         1              4             2             0
B2         0              0             0             0
B3         5              2             0             0

I want an algorithm which creates a new table that matches people based on the closest similarity eg.

SetA ID     SetB ID
  A1           B2
  A2           B4
  A3           B72

My first problem: what format do I want my data in? Should I reverse the preferences notation (ie have '5' = most interested, as it seems this would make more sense as 0 is no interest?) - is this needed? Currently, I have just appended dfB to the bottom of dfA and run the K-Means fit on that.

My second problem: which algorithm and parameters should I use? I've read about 'cosine' for calculating distances? Particularly, I want the algorithm to put a higher weight on interests which are less common (eg only 2 people are interested in 'Interest4', rank this highly, but rank other very common interests more highly. In this case, these 2 people should be matched over others).

desertnaut
  • 57,590
  • 26
  • 140
  • 166
chillingfox
  • 131
  • 9
  • That's really not what he/she is asking about @Chris – James Jun 11 '20 at 11:48
  • With ordinal data, yes, you should probably reverse the interest level to have 5 be the most interest if you have set 0 to mean no interest. K-means clustering in 35 dimensions when most of the dimensions are 0 is probably not going to have stellar results. – James Jun 11 '20 at 11:56
  • @Chris Giving specific features more weight definitely solves one of my problems - thanks!! – chillingfox Jun 11 '20 at 15:20
  • @James I've reversed the interest level now. I think I want to use KNN to make the table with matches of each person - anything I can read for this? – chillingfox Jun 11 '20 at 15:21

0 Answers0