Object Similarity Pandas and Scikit Learn

Question

Is there a way to find to find and rank rows in a Pandas Dataframe by their similarity to a row from another Dataframe?

Sample data? How do you want to define your similarity measure? — Alexander, Aug 09 '17 at 22:49
I'm thinking of similarity in terms of a distance measure. Something like Cosine or Euclidean distance. — christfan868, Aug 09 '17 at 23:04
`df.T.corrwith(df2.T)` somewhat along the lines of what you are looking for? — Uvar, Aug 09 '17 at 23:48

AChervony · Accepted Answer · 2017-08-12T02:13:14.007

My understanding of your question: you have two data frames, hopfully of the same column count. You want to rate first data frame's, the subject data frame, members by how close, i.e. similar, they are to any of the members of the target data frame.

I am not aware of a built in method. It is probably not the most efficient way but here is how I'd approach:

#! /usr/bin/python3

import pandas as pd
import numpy as np
import pprint
pp = pprint.PrettyPrinter(indent=4)

# Simulate data
df_subject = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're iterating to check similarity to target.
df_target = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're checking distance to

# This will hold the min dstances.
distances=[]
# Loop to iterate over subject DF
for ix1,subject in df_subject.iterrows():
    distances_cur=[]
    # Loop to iterate over target DF
    for ix2,target in df_target.iterrows():
        distances_cur.append(np.linalg.norm(target-subject))
    # Get the minimum distance for the subject set member.
    distances.append(min(distances_cur))

# Distances to df
distances=pd.DataFrame(distances)
# Normalize.
distances=0.5-(distances-distances.mean(axis=0))/distances.max(axis=0)

# Column index joining, ordering and beautification.
Proximity_Ratings_name='Proximity Ratings'
distances=distances.rename(columns={0: Proximity_Ratings_name})
df_subject=df_subject.join(distances)
pp.pprint(df_subject.sort_values(Proximity_Ratings_name,ascending=False))

It should yeild something like the table below. Higher rating means there's a similar member in the target data frame:

     A   B   C   D  Proximity Ratings
55  86  21  91  78           0.941537
38  91  31  35  95           0.901638
43  49  89  49   6           0.878030
98  28  98  98  36           0.813685
77  67  23  78  84           0.809324
35  52  16  36  58           0.802223
54   2  25  61  44           0.788591
95  76   3  60  46           0.766896
5   55  39  88  37           0.756049
52  79  71  90  70           0.752520
66  52  27  82  82           0.751353
41  45  67  55  33           0.739919
76  12  93  50  62           0.720323
94  99  84  39  63           0.716123
26  62   6  97  60           0.715081
40  64  50  37  27           0.714042
68  70  21   8  82           0.698824
47  90  54  60  65           0.676680
7   85  95  45  71           0.672036
2   14  68  50   6           0.661113
34  62  63  83  29           0.659322
8   87  90  28  74           0.647873
75  14  61  27  68           0.633370
60   9  91  42  40           0.630030
4   46  46  52  35           0.621792
81  94  19  82  44           0.614510
73  67  27  34  92           0.608137
30  92  64  93  51           0.608137
11  52  25  93  50           0.605770
51  17  48  57  52           0.604984
..  ..  ..  ..  ..                ...
64  28  56   0   9           0.397054
18  52  84  36  79           0.396518
99  41   5  32  34           0.388519
27  19  54  43  94           0.382714
92  69  56  73  93           0.382714
59   1  29  46  16           0.374878
58   2  36   8  96           0.362525
69  58  92  16  48           0.361505
31  27  57  80  35           0.349887
10  59  23  47  24           0.345891
96  41  77  76  33           0.345891
78  42  71  87  65           0.344398
93  12  31   6  27           0.329152
23   6   5  10  42           0.320445
14  44   6  43  29           0.319964
6   81  51  44  15           0.311840
3   17  60  13  22           0.293066
70  28  40  22  82           0.251549
36  95  72  35   5           0.249354
49  78  10  30  18           0.242370
17  79  69  57  96           0.225168
46  42  95  86  81           0.224742
84  58  81  59  86           0.221346
9    9  62   8  30           0.211659
72  11  51  74   8           0.159265
90  74  26  80   1           0.138993
20  90   4   6   5           0.117652
50   3  12   5  53           0.077088
42  90  76  42   1           0.075284
45  94  46  88  14           0.054244

Hope I understand correctly. Don't use if performance matters, I'm sure there's an algebraic way to approach this (Multiply matrices) that would run way faster.

That you so much AChervony, the *np.linalg.norm()* function was quite useful. It can also use be used as a function for calculating Euclidean distance apparently. I learnt even more by looking at this thread: https://stackoverflow.com/questions/32141856/is-norm-equivalent-to-euclidean-distance — christfan868, Aug 15 '17 at 15:04

Object Similarity Pandas and Scikit Learn

1 Answers1