-1

I'm trying to implement cosine similarity for two vectors, but I ran into a special case where the two vectors only have one component, like this:

v1 = [3] 
v2 = [4]

Here is my implementation for the cosine similarity:

def dotProduct(v1, v2):
    if len(v1) != len(v2):
        return 0
    return sum([x * y for x, y in zip(v1, v2)])

def cosineSim(v1, v2):
    dp = dotProduct(v1, v2)
    mag1 = math.sqrt(dotProduct(v1, v1))
    mag2 = math.sqrt(dotProduct(v2, v2))
    return dp / (mag1 * mag2)

The cosine similarity for any two vectors that only have one component is always 1 then. Can someone guide me through how to handle this special case? Thank you.

efsee
  • 579
  • 1
  • 10
  • 22
  • 1
    Why not use numpy? – cs95 Mar 09 '18 at 18:54
  • @cᴏʟᴅsᴘᴇᴇᴅ can you be more specific? I don't really understand, how to use numpy in this case, thank you. – efsee Mar 09 '18 at 18:57
  • 1
    `np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))` – cs95 Mar 09 '18 at 18:58
  • @cᴏʟᴅsᴘᴇᴇᴅ But the results are still all 1 – efsee Mar 09 '18 at 19:00
  • Obviously. The cosine similarity between two scalars (length 1 vectors) is 1. – cs95 Mar 09 '18 at 19:00
  • @cᴏʟᴅsᴘᴇᴇᴅ I'm trying to implement cosine similarity to rank documents with tf-idf score, in this case, with only one word, how should I rank them? – efsee Mar 09 '18 at 19:04
  • 2
    @efsee um, the dimensions of your document vectors shouldn't be changing. – juanpa.arrivillaga Mar 09 '18 at 19:05
  • @efsee if you're doing tf-idf on documents, the vector for each document ought to be its one-hot encoding, not a raw list of the word indices. That way, you can compare documents of different lengths, and documents of length 1 have well-defined behavior, specifically that they have similarity of 1 to identical documents and 0 to all others. – scnerd Mar 09 '18 at 19:07
  • 1
    @scnerd Not one hot, but fixed length vectors at least. – cs95 Mar 09 '18 at 19:08
  • @cᴏʟᴅsᴘᴇᴇᴅ I stand corrected. One-hot is an easy solution if you already have word indices, otherwise you could look into a document embedding algorithm of some sort. Regardless, cosine similarity is only defined for vectors of equal dimensionality, so you really need all your document vectors to have the same length if you're going to rank them. – scnerd Mar 09 '18 at 19:11
  • This is a question about (applied) math, not programming. – Davis Herring Mar 09 '18 at 19:31

2 Answers2

1

The correct answer here is to use numpy. As @COLDSPEED said, use numpy vectors use them to perform your operation. The most succinct way to do this is with scipy's cosine distance function:

from scipy.spatial.distance import cosine

cosine_similarity = 1 - cosine(v1, v2)
# Or...
cosine_distance = cosine(v1, v2)

Or using raw numpy arrays, you can do it yourself:

import numpy as np

v1 = np.array(v1)
v2 = np.array(v2)
cosine_similarity = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

If you must re-implement the wheel for some reason, your solution would probably be another if case:

def dotProduct(v1, v2):
    if len(v1) != len(v2):
        return 0
    if len(v1) == 1:  # You only need to check one, since they're the same
        return 1
    return sum([x * y for x, y in zip(v1, v2)])
scnerd
  • 5,836
  • 2
  • 21
  • 36
  • My problem is that now I have a query of one word, and I want to rank result documents by cosine similarity, the dimension of my vectors is the number of terms in my query. So I have a query vector with one dimension and a bunch of document vectors with one dimension. – efsee Mar 09 '18 at 19:24
  • @efsee "the dimension of my vectors is the number of terms in my query" You are approaching this completely incorrectly. Presumably, you are attempting some form of a bag-of-words / vector-space model for your documents. Your documents should *always have the same dimensions*. As you are discovering, a vector-space model with only one dimension is not very useful, especially when it comes to ranking by cosine similarity... – juanpa.arrivillaga Mar 09 '18 at 19:26
  • @efsee Ok, I accept that this answer doesn't solve the actual problem you're trying to solve, just the one you asked. See the comments above for why you shouldn't have any difference in your vectors' dimensionalities in the first place. In order to use cosine similarity (or any distance metric I've traditionally seen used), you need to embed everything, both documents and queries, into a constant dimension. Look into one-hot encodings and document embeddings for how to do this. – scnerd Mar 09 '18 at 19:27
-1

Try this code snippet:

if a*b == 0:
  return 0
if a*b < 0:
  return -1
return 1
SuperStar518
  • 2,814
  • 2
  • 20
  • 35
vishfrnds
  • 9
  • 2