cosine similarity between a vector and pandas column(a linear vector)

Question

I have a pandas data frame containing list of wines with their respective wine attributes.

Then I made a new column vector that contains numpy vectors from these attributes.

def get_wine_profile(id):
wine = wines[wines['exclusiviId'] == id]
wine_vector = np.array(wine[wine_attrs].values.tolist()).flatten()

return wine_vector

wines['vector'] = wines.exclusiviId.apply(get_wine_profile)

hence the vector column look something like this

vector

[1, 1, 1, 2, 2, 2, 2, 1, 1, 1]

[3, 1, 2, 1, 2, 2, 2, 0, 1, 3]

[1, 1, 2, 1, 3, 3, 3, 0, 1, 1]

.

.

now I want to perform cosine similarity between this column and another vector that is resulting vector from the user input This is what i have tried so far

from scipy.spatial.distance import cosine
cos_vec = wines.apply(lambda x: (1-cosine(wines["vector"],[1, 1, 1, 2, 2, 2, 2, 1, 1, 1]), axis=1)
Print(cos_vec)

this is throwing error

ValueError: ('operands could not be broadcast together with shapes (63,) (10,) ', 'occurred at index 0')

I also tries using sklearn but it also have the same problem with the arrar shape

what i want as a final output is a column that has match score between this column and user input

score 0 · Answer 1 · answered Jun 05 '18 at 11:03

A better solution IMO is to use cdist with cosine metric. You are effectively computing pairwise distances between n points in your DataFrame and 1 point in your user input, i.e. n pairs in total.

If you handle more than one user at a time, this would be even more efficient.

from scipy.spatial.distance import cdist

# make into 1x10 array
user_input = np.array([1, 1, 1, 2, 2, 2, 2, 1, 1, 1])[None]
df["cos_dist"] = cdist(np.stack(df.vector), user_input, metric="cosine")


# vector  cos_dist
# 0  [1, 1, 1, 2, 2, 2, 2, 1, 1, 1]   0.00000
# 1  [3, 1, 2, 1, 2, 2, 2, 0, 1, 3]   0.15880
# 2  [1, 1, 2, 1, 3, 3, 3, 0, 1, 1]   0.07613

By the way, it looks like you are using native Python lists. I would switch everything to numpy arrays. A conversion to np.array is happening under the hood anyway when you call cosine.

thanks, @yakym for the answer but i already tried a different approach which worked — Atlancey India, Jun 06 '18 at 11:36

score 0 · Answer 2 · answered Jun 06 '18 at 11:35

well i made my own function to do this and yes it works

import math
def cosine_similarity(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
    x = v1[i]; y = v2[i]
    sumxx += x*x
    sumyy += y*y
    sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)

def get_similarity(id):
 vec1 = result_vector
 vec2 = get_wine_profile(id)
 similarity = cosine_similarity(vec1, vec2)
 return similarity

wines['score'] = wines.exclusiviId.apply(get_similarity)
display(wines.head())

cosine similarity between a vector and pandas column(a linear vector)

2 Answers2