0

I'm quite new to machine learning. I'm trying to match people from SetA with people from SetB based on their interest ratings (1=Low, 10=High). My real data set has 40 features (also later I want to set a higher weighting on certain features, as well as interests that are less common - I believe this will help me?).

Example dataset:

>>> dfA = pd.DataFrame(np.array([[1, 1, 1], [4, 4, 4], [8, 8, 8]]),
                   columns=['interest1', 'interest2', 'interest3'],
                  index=['personA1','personA2','personA3'])

>>> dfB = pd.DataFrame(np.array([[4, 4, 3], [2, 2, 1], [1, 2, 2]]),
                   columns=['interest1', 'interest2', 'interest3'],
                  index=['personB1','personB2','personB3'])

print(dfA, "\n", dfB)


>>>           interest1  interest2  interest3
personA1          1          1          1
personA2          4          4          4
personA3          8          8          8 

          interest1  interest2  interest3
personB1          4          4          3
personB2          2          2          1
personB3          1          2          2

I'm using sklearn's nearest neighbors algorithm for this:

knn = NearestNeighbors(n_neighbors = 2).fit(dfA)

distances, indicies = knn.kneighbors(dfB)

>>> print(distances, "\n \n", indicies)

>>>[[1.         4.69041576]
 [1.41421356 4.12310563]
 [1.41421356 4.12310563]] 

 [[1 0]
 [0 1]
 [0 1]]

I don't understand the output? I'm aware of a similar question's explanation however I don't know how to apply it to this situation as there are 2 different datasets.

Ultimately, I want a final dataframe for matches like:

SetA             SetB
personA1        personB2
personA2        personB1
personA3        personB3
chillingfox
  • 131
  • 9

1 Answers1

0

The results that you get are the nearest neighbours of a given person in SetB selected from the people in SetA.
In other words, the first element distances[0] tells you the distances of personB1 from its two nearest neighbours in SetA. indicies[0] tells you the indices of those two persons.

In this example:
indicies[0] = [1, 0] means that personB1's nearest neighbours in SetA1 are SetA[1] = personA2 and SetA[0] = personA1.
distances[0] = [1. 4.69041576] tells us that the distance between personB1 and personA2 is 1, and that the distance between personB1 and personA1 is 4.69041576 (you can easily check this if you compute the Euclidean distances by hand).

A couple of remarks:

  • From the description of your problem, it seems that you are interested only the the nearest neighbour of a person in SetB from a person in SetA (not the 2 nearest neighbours). If that is the case, I would suggest changing n_neighbors=2 to n_neighbors=1 in the knn parameters.

  • Be careful with your indices: in your dataset the labels start from 1 (personA1, personA2, ...), but in knn the indices always start from 0. This can lead to confusion when things get more complicated, since SetA[0]=personA1, so be mindful about it.

ValleyCrisps
  • 356
  • 2
  • 9
  • This is a very clear explanation, thank you! My next problem is that it seems the closest neighbour for B2 and B3 are both A1. People from SetA can only be assigned 1, unique person from SetB (so I can make the pairings table at the bottom of my question). How do I do this? – chillingfox Jun 12 '20 at 09:37
  • You are welcome! If the answer was helpful you may consider upvoting or accepting it, so that it can be useful to other people too. – ValleyCrisps Jun 12 '20 at 10:06
  • It is a priori possible that a person in SetA is more popular than another. For example, a person who likes football have closer neighbours than a person who is into stamp-collecting. KNN does not solve that problem for you, so you have to find a workaround. If you absolutely need a 1-1 correspondence, you should compute the distance from more than 1 or 2 neighbours and write some code that "eliminates" the persons in satA as they find their companion in setB. You may consider also writing your own custom KNN classifier from scratch for this purpose. – ValleyCrisps Jun 12 '20 at 10:16
  • You should see a check mark next to the answer, as explained here: https://stackoverflow.com/help/someone-answers Good luck with your project! – ValleyCrisps Jun 12 '20 at 11:11
  • Aha thanks for all your help! Last questions: if I have n_neighbors = 3 for example, will the first neighbour in indices always be the "top" (nearest) match? and then indices[1:] will be ordered by decreasing near-ness? – chillingfox Jun 12 '20 at 11:34
  • Yes, that is correct. The neighbours are listed in order from the closest to the furthest, so the indices array will show your best option, second best option, third best option (and so on) in this order. – ValleyCrisps Jun 12 '20 at 11:43