Cosine similarity for special vectors (only one component)

Question

I'm trying to implement cosine similarity for two vectors, but I ran into a special case where the two vectors only have one component, like this:

v1 = [3] 
v2 = [4]

Here is my implementation for the cosine similarity:

def dotProduct(v1, v2):
    if len(v1) != len(v2):
        return 0
    return sum([x * y for x, y in zip(v1, v2)])

def cosineSim(v1, v2):
    dp = dotProduct(v1, v2)
    mag1 = math.sqrt(dotProduct(v1, v1))
    mag2 = math.sqrt(dotProduct(v2, v2))
    return dp / (mag1 * mag2)

The cosine similarity for any two vectors that only have one component is always 1 then. Can someone guide me through how to handle this special case? Thank you.

@cᴏʟᴅsᴘᴇᴇᴅ can you be more specific? I don't really understand, how to use numpy in this case, thank you. — efsee, Mar 09 '18 at 18:57
`np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))` — cs95, Mar 09 '18 at 18:58
Obviously. The cosine similarity between two scalars (length 1 vectors) is 1. — cs95, Mar 09 '18 at 19:00
@cᴏʟᴅsᴘᴇᴇᴅ I'm trying to implement cosine similarity to rank documents with tf-idf score, in this case, with only one word, how should I rank them? — efsee, Mar 09 '18 at 19:04
@efsee um, the dimensions of your document vectors shouldn't be changing. — juanpa.arrivillaga, Mar 09 '18 at 19:05
@efsee if you're doing tf-idf on documents, the vector for each document ought to be its one-hot encoding, not a raw list of the word indices. That way, you can compare documents of different lengths, and documents of length 1 have well-defined behavior, specifically that they have similarity of 1 to identical documents and 0 to all others. — scnerd, Mar 09 '18 at 19:07
@cᴏʟᴅsᴘᴇᴇᴅ I stand corrected. One-hot is an easy solution if you already have word indices, otherwise you could look into a document embedding algorithm of some sort. Regardless, cosine similarity is only defined for vectors of equal dimensionality, so you really need all your document vectors to have the same length if you're going to rank them. — scnerd, Mar 09 '18 at 19:11

score 1 · Accepted Answer · answered Mar 09 '18 at 19:04

1

The correct answer here is to use numpy. As @COLDSPEED said, use numpy vectors use them to perform your operation. The most succinct way to do this is with scipy's cosine distance function:

from scipy.spatial.distance import cosine

cosine_similarity = 1 - cosine(v1, v2)
# Or...
cosine_distance = cosine(v1, v2)

Or using raw numpy arrays, you can do it yourself:

import numpy as np

v1 = np.array(v1)
v2 = np.array(v2)
cosine_similarity = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

If you must re-implement the wheel for some reason, your solution would probably be another if case:

def dotProduct(v1, v2):
    if len(v1) != len(v2):
        return 0
    if len(v1) == 1:  # You only need to check one, since they're the same
        return 1
    return sum([x * y for x, y in zip(v1, v2)])

answered Mar 09 '18 at 19:04

scnerd

5,836
2
21
36

My problem is that now I have a query of one word, and I want to rank result documents by cosine similarity, the dimension of my vectors is the number of terms in my query. So I have a query vector with one dimension and a bunch of document vectors with one dimension. – efsee Mar 09 '18 at 19:24
@efsee "the dimension of my vectors is the number of terms in my query" You are approaching this completely incorrectly. Presumably, you are attempting some form of a bag-of-words / vector-space model for your documents. Your documents should *always have the same dimensions*. As you are discovering, a vector-space model with only one dimension is not very useful, especially when it comes to ranking by cosine similarity... – juanpa.arrivillaga Mar 09 '18 at 19:26
@efsee Ok, I accept that this answer doesn't solve the actual problem you're trying to solve, just the one you asked. See the comments above for why you shouldn't have any difference in your vectors' dimensionalities in the first place. In order to use cosine similarity (or any distance metric I've traditionally seen used), you need to embed everything, both documents and queries, into a constant dimension. Look into one-hot encodings and document embeddings for how to do this. – scnerd Mar 09 '18 at 19:27

score -1 · Answer 2 · edited Jun 22 '21 at 01:49

-1

Try this code snippet:

if a*b == 0:
  return 0
if a*b < 0:
  return -1
return 1

edited Jun 22 '21 at 01:49

SuperStar518

2,814
2
20
35

answered Jun 21 '21 at 15:13

vishfrnds

9
2

Cosine similarity for special vectors (only one component)

2 Answers2