7

Suppose I have two columns in a python pandas.DataFrame:

          col1 col2
item_1    158  173
item_2     25  191
item_3    180   33
item_4    152  165
item_5     96  108

What's the best way to take the cosine similarity of these two columns?

hlin117
  • 20,764
  • 31
  • 72
  • 93
  • For clarity, I presume that you mean: other than simply applying the formula, i.e., computing the magnitudes, normalizing, and doing the sum product. – Leo Sep 09 '14 at 04:52
  • @leo Yes, I mean what is the most optimized way. However, if there's a functional way that takes only a few lines, I'll be happy with that too. – hlin117 Sep 09 '14 at 05:18
  • Looks like there's relevant functions in [Scipy](http://docs.scipy.org/doc/scipy/reference/spatial.distance.html) – Marius Sep 09 '14 at 05:29

3 Answers3

11

Is that what you're looking for?

from scipy.spatial.distance import cosine
from pandas import DataFrame


df = DataFrame({"col1": [158, 25, 180, 152, 96],
                "col2": [173, 191, 33, 165, 108]})

print(1 - cosine(df["col1"], df["col2"]))
xbello
  • 7,223
  • 3
  • 28
  • 41
  • One liners are always welcome, thanks! I think I'm focusing too much on finding functionality within python pandas itself, but not looking into the packages it integrates with, like scipy. – hlin117 Sep 09 '14 at 16:13
  • Note that if you have two different series with different indices, `NaN` values will be ignored by the cosine similarity computation, leading to an incorrect answer, as the norms in the denominator will be computed incorrectly (some values will be dropped to align with the other series) – Sergey Orshanskiy Sep 01 '16 at 00:41
8

You can also use cosine_similarity or other similarity metrics from sklearn.metrics.pairwise.

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(df.col1, df.col2)
Out[4]: array([[0.7498213]])
Amir Imani
  • 3,118
  • 2
  • 22
  • 24
  • 4
    I used `df['col1'].values.reshape(1, -1)` and `df['col2'].values.reshape(1, -1)` to get this to work. – Eric Ness Jul 08 '19 at 21:14
1

In my case I had a bit more complicated situation where 2 columns I wanted to compare were of different length (in other words, some NaN values were there). In this case the method represented in the accepted answer doesn't work as is (it outputs nan).

So, I used a following little trick to tackle with it. First, you concatenate 2 columns of interest into a new data frame. Then you drop NaN. After that those 2 columns have only corresponding rows, and you can compare them with cosine distance or any other pairwise distance you wish.

import pandas as pd
from scipy.spatial import distance

index = ['item_1', 'item_2', 'item_3', 'item_4', 'item_5']
cols = [pd.Series([158, 25, 180, 152, 96], index=index, name='col1'),
        pd.Series([173, 191, 33, 165, 108], index=index, name='col2'),
        pd.Series([183, 204, 56], index=['item_1', 'item_4', 'item_5'], name='col3')]
df = pd.concat(cols, axis=1)
print(df)
print(distance.cosine(df['col2'], df['col3']))

Output:

        col1  col2   col3
item_1   158   173  183.0
item_2    25   191    NaN
item_3   180    33    NaN
item_4   152   165  204.0
item_5    96   108   56.0
nan

What you do is:

tdf = pd.concat([df['col2'], df['col3']], axis=1).dropna()
print(tdf)
print(distance.cosine(tdf['col2'], tdf['col3']))

Output is:

        col2   col3
item_1   173  183.0
item_4   165  204.0
item_5   108   56.0
0.02741129579408741
july_coder
  • 61
  • 2
  • But in this case, you are dropping values from col2. Thus, this gives an incorrect similarity score. – Kebby Mar 06 '23 at 22:13