1

I wish to transform a Collaborative Filtering with Python through Cosine Similarity to Adjusted Cosine Similarity.

The cosine similarity based implementation looks like this:

import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
from scipy.spatial.distance import pdist, squareform

data = pd.read_csv("C:\\Sample.csv")
data_germany = data.drop("Name", 1)
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)

for i in range(0,len(data_ibs.columns)) :
    for j in range(0,len(data_ibs.columns)) :
      data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])

data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,6))

for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:] = data_ibs.ix[0:,i].sort_values(ascending=False)[:5].index

df = data_neighbours.head().ix[:,2:6]
print df

an the Sample.csv being used looked like:

Sample.csv

where 1 denotes that a user purchased a particular fruit, and conversely 0 denotes that a user didn't purchase a particular fruit

When I run the code above this is what I get:

results1

where rows are fruits and columns are similarity ranks (in decreasing order). In this example, Pear is the most similar to Apple, Melonis the second most similar, and so on.

I came across this post on Adjusted Cosine Similarity and I tried to integrate that approach into my code. In this case the data are rating scores given by users to the fruit:

ratings

Here's my attempt:

data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)
M_u = data_ibs.mean(axis=1)
M = np.asarray(data_ibs)
item_mean_subtracted = M - M_u[:, None]

for i in range(0,len(data_ibs.columns)) :
    for j in range(0,len(data_ibs.columns)) :
      data_ibs.ix[i,j]  = 1 - squareform(pdist(item_mean_subtracted.T, "cosine")) ### error

data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,6))

for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:] = data_ibs.ix[0:,i].sort_values(ascending=False)[:5].index

df = data_neighbours.head().ix[:,2:6]

But I'm stuck. My question is: how can the Adjusted Cosine Similarity be successfully applied into this sample?

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    The question you're asking is too broad - there's a thousand factors that go into successfully applying a machine learning technique. You need to ask a more specific question, where people have a realistic chance of helping you. Also, what do you mean by "it stuck"? – Horia Coman Mar 20 '17 at 08:27
  • @Horia Coman, thank you for the comment. I mean to ask, how to successfully apply Adjusted Cosine Similarity in this sample. a new line gives "ValueError: setting an array element with a sequence." – Mark K Mar 20 '17 at 08:33
  • 1
    Oh. I did not see the "### error" comment. It appears squareform produces a sequence of numbers, and you are trying to assign it to an array element, which is a single number. Perhaps try to step with the debugger to that line and see what sort of output it produces. Or add a bunch of print statements. – Horia Coman Mar 20 '17 at 08:35
  • @HoriaComan, thanks again. is my sample added lines the right direction? I meant, what would be the right/other way to produce Adjusted Cosine Similarity for the sample? – Mark K Mar 20 '17 at 08:43

1 Answers1

2

Here's a NumPy based solution to your problem.

First we store rating data into an array:

fruits = np.asarray(['Apple', 'Orange', 'Pear', 'Grape', 'Melon'])
M = np.asarray(data.loc[:, fruits])

Then we calculate the adjusted cosine similarity matrix:

M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))

And finally we sort the results in decreasing order of similarity:

indices = np.fliplr(np.argsort(similarity_matrix, axis=1)[:,:-1])
result = np.hstack((fruits[:, None], fruits[indices]))

DEMO

In [49]: M
Out[49]: 
array([[ 0, 10,  0,  1,  0],
       [ 6,  0,  0,  0,  2],
       [ 1,  0, 20,  0,  1],
       [ 0,  3,  6,  0, 18],
       [ 3,  0,  2,  0,  0],
       [ 0,  2,  0,  5,  0]])

In [50]: np.set_printoptions(precision=2)

In [51]: similarity_matrix
Out[51]: 
array([[ 1.  ,  0.01, -0.41,  0.48, -0.44],
       [ 0.01,  1.  , -0.57,  0.37, -0.26],
       [-0.41, -0.57,  1.  , -0.56, -0.19],
       [ 0.48,  0.37, -0.56,  1.  , -0.51],
       [-0.44, -0.26, -0.19, -0.51,  1.  ]])

In [52]: result
Out[52]: 
array([['Apple', 'Grape', 'Orange', 'Pear', 'Melon'],
       ['Orange', 'Grape', 'Apple', 'Melon', 'Pear'],
       ['Pear', 'Melon', 'Apple', 'Grape', 'Orange'],
       ['Grape', 'Apple', 'Orange', 'Melon', 'Pear'],
       ['Melon', 'Pear', 'Orange', 'Apple', 'Grape']], 
      dtype='|S6')
Tonechas
  • 13,398
  • 16
  • 46
  • 80
  • thank you so much for the work and help! I’ve both up voted and chose it the answer (sorry for the late reply as there’s a time difference. :). – Mark K Mar 21 '17 at 02:05
  • but the M I am getting here is all "nan". – Mark K Mar 21 '17 at 03:25
  • 1
    You need to replace `M = np.asarray(data_ibs.loc[:, fruits])` by `M = np.asarray(data.loc[:, fruits])`. I edited my answer accordingly. – Tonechas Mar 21 '17 at 08:30