I'm not really sure how to go about this problem. I've tried using K-Means but it doesn't seem to be working.
I want to match Person A from Set A with Person B from Set B based on their interests.
The dataframes have person ID, and columns related to interests (n=35). Interest choices are the same for both Sets.
Interests are rated on a scale of 1-5, with 1 being top interest and 5 being the lowest interest. People can only choose 5 interests in total, so any interest not chosen is denoted with a 0.
Example:
>>> dfA
Id interest1 interest2 interest3 .. interest35
A1 1 4 2 0
A2 0 0 0 0
A3 5 2 0 0
>>> dfB
Id interest1 interest2 interest3 .. interest35
B1 1 4 2 0
B2 0 0 0 0
B3 5 2 0 0
I want an algorithm which creates a new table that matches people based on the closest similarity eg.
SetA ID SetB ID
A1 B2
A2 B4
A3 B72
My first problem: what format do I want my data in? Should I reverse the preferences notation (ie have '5' = most interested, as it seems this would make more sense as 0 is no interest?) - is this needed? Currently, I have just appended dfB to the bottom of dfA and run the K-Means fit on that.
My second problem: which algorithm and parameters should I use? I've read about 'cosine' for calculating distances? Particularly, I want the algorithm to put a higher weight on interests which are less common (eg only 2 people are interested in 'Interest4', rank this highly, but rank other very common interests more highly. In this case, these 2 people should be matched over others).