Python, Cosine Similarity to Adjusted Cosine Similarity

Question

I wish to transform a Collaborative Filtering with Python through Cosine Similarity to Adjusted Cosine Similarity.

The cosine similarity based implementation looks like this:

import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
from scipy.spatial.distance import pdist, squareform

data = pd.read_csv("C:\\Sample.csv")
data_germany = data.drop("Name", 1)
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)

for i in range(0,len(data_ibs.columns)) :
    for j in range(0,len(data_ibs.columns)) :
      data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])

data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,6))

for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:] = data_ibs.ix[0:,i].sort_values(ascending=False)[:5].index

df = data_neighbours.head().ix[:,2:6]
print df

an the Sample.csv being used looked like:

where 1 denotes that a user purchased a particular fruit, and conversely 0 denotes that a user didn't purchase a particular fruit

When I run the code above this is what I get:

where rows are fruits and columns are similarity ranks (in decreasing order). In this example, Pear is the most similar to Apple, Melonis the second most similar, and so on.

I came across this post on Adjusted Cosine Similarity and I tried to integrate that approach into my code. In this case the data are rating scores given by users to the fruit:

Here's my attempt:

data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)
M_u = data_ibs.mean(axis=1)
M = np.asarray(data_ibs)
item_mean_subtracted = M - M_u[:, None]

for i in range(0,len(data_ibs.columns)) :
    for j in range(0,len(data_ibs.columns)) :
      data_ibs.ix[i,j]  = 1 - squareform(pdist(item_mean_subtracted.T, "cosine")) ### error

data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,6))

for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:] = data_ibs.ix[0:,i].sort_values(ascending=False)[:5].index

df = data_neighbours.head().ix[:,2:6]

But I'm stuck. My question is: how can the Adjusted Cosine Similarity be successfully applied into this sample?

The question you're asking is too broad - there's a thousand factors that go into successfully applying a machine learning technique. You need to ask a more specific question, where people have a realistic chance of helping you. Also, what do you mean by "it stuck"? — Horia Coman, Mar 20 '17 at 08:27
@Horia Coman, thank you for the comment. I mean to ask, how to successfully apply Adjusted Cosine Similarity in this sample. a new line gives "ValueError: setting an array element with a sequence." — Mark K, Mar 20 '17 at 08:33
Oh. I did not see the "### error" comment. It appears squareform produces a sequence of numbers, and you are trying to assign it to an array element, which is a single number. Perhaps try to step with the debugger to that line and see what sort of output it produces. Or add a bunch of print statements. — Horia Coman, Mar 20 '17 at 08:35
@HoriaComan, thanks again. is my sample added lines the right direction? I meant, what would be the right/other way to produce Adjusted Cosine Similarity for the sample? — Mark K, Mar 20 '17 at 08:43

Tonechas · Accepted Answer · 2017-03-21T08:29:04.133

Here's a NumPy based solution to your problem.

First we store rating data into an array:

fruits = np.asarray(['Apple', 'Orange', 'Pear', 'Grape', 'Melon'])
M = np.asarray(data.loc[:, fruits])

Then we calculate the adjusted cosine similarity matrix:

M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))

And finally we sort the results in decreasing order of similarity:

indices = np.fliplr(np.argsort(similarity_matrix, axis=1)[:,:-1])
result = np.hstack((fruits[:, None], fruits[indices]))

DEMO

In [49]: M
Out[49]: 
array([[ 0, 10,  0,  1,  0],
       [ 6,  0,  0,  0,  2],
       [ 1,  0, 20,  0,  1],
       [ 0,  3,  6,  0, 18],
       [ 3,  0,  2,  0,  0],
       [ 0,  2,  0,  5,  0]])

In [50]: np.set_printoptions(precision=2)

In [51]: similarity_matrix
Out[51]: 
array([[ 1.  ,  0.01, -0.41,  0.48, -0.44],
       [ 0.01,  1.  , -0.57,  0.37, -0.26],
       [-0.41, -0.57,  1.  , -0.56, -0.19],
       [ 0.48,  0.37, -0.56,  1.  , -0.51],
       [-0.44, -0.26, -0.19, -0.51,  1.  ]])

In [52]: result
Out[52]: 
array([['Apple', 'Grape', 'Orange', 'Pear', 'Melon'],
       ['Orange', 'Grape', 'Apple', 'Melon', 'Pear'],
       ['Pear', 'Melon', 'Apple', 'Grape', 'Orange'],
       ['Grape', 'Apple', 'Orange', 'Melon', 'Pear'],
       ['Melon', 'Pear', 'Orange', 'Apple', 'Grape']], 
      dtype='|S6')

thank you so much for the work and help! I’ve both up voted and chose it the answer (sorry for the late reply as there’s a time difference. :). — Mark K, Mar 21 '17 at 02:05
You need to replace `M = np.asarray(data_ibs.loc[:, fruits])` by `M = np.asarray(data.loc[:, fruits])`. I edited my answer accordingly. — Tonechas, Mar 21 '17 at 08:30

Python, Cosine Similarity to Adjusted Cosine Similarity

1 Answers1