-1

Can someone explain these two formulas? Do they have any relationship?

def _cosine_distance(a, b, data_is_normalized=False):
    if not data_is_normalized:
        a = np.asarray(a) / np.linalg.norm(a, axis=1, keepdims=True)
        b = np.asarray(b) / np.linalg.norm(b, axis=1, keepdims=True)
    return 1. - np.dot(a, b.T)

def findCosineSimilarity(source_representation, test_representation):
    a = np.matmul(np.transpose(source_representation), test_representation)
    b = np.sum(np.multiply(source_representation, source_representation))
    c = np.sum(np.multiply(test_representation, test_representation))
    return 1 - (a / (np.sqrt(b) * np.sqrt(c)))```
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Have you tested if they calculate the same output for the same input? My guess is that it's the same function ([cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity), `1-S_c(A,B)`), but the top approach first calculates the individual fractions, while the bottom approach calculates the fraction last. By the way, as I see it, none of the equations above describe cosine **similarity**, but both describe the cosine **distance**. – André Apr 25 '22 at 14:17
  • findCosineSimilarity works with 1D array, the other one not. Also, two arrays of shape (2,5) were given as input, _cosine_distance returned (2,2) matrix, findCosineSimilarity- (5,5) matrix. I think there should be relation. Also, I cannot understand np.linalg.norm() function – Emil Seyfullayev Apr 25 '22 at 16:10

1 Answers1

0

Regarding your comment, the cosine distance of two matrices of shape 2 x 5 essentially consists of finding the pairwise cosine distance between the vectors in each array. Assuming you are working with row vectors (which you should when you use NumPy conventionally), the expected output should consist of 2 * 2 = 4 elements. If you are working with column vectors, then 5 * 5 = 25 elements makes sense.

_cosine_distance looks good

The function _cosine_distance is correct in naming and implementation generally for all cases where a in N^{n x l} and b in N^{m x l}.

To use _cosine_distance for 1D arrays you can simply add a singleton dimension at axis 0, e.g. _cosine_distance(a[np.newaxis], b[np.newaxis]).

findCosineSimilarity looks bad

findCosineSimilarity is incorrect in naming (it calculates the cosine distance), and the implementation only works if you have one dimensional arrays. Using this for anything other than 1D arrays will fail as it will compute something that is incorrect by the definition of cosine distance. Also, transposing source_representation (the left matrix) hints that the function is meant for column vectors, which differs from _cosine_distance, not that findCosineSimilarity would work for matrices anyways.

It is easy to create a column/row vector agnostic test case by using a n x n matrix:

1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1

If we calculate the pairwise cosine distance for every vector in the matrix we should get all zeros as the vectors are the same.

import numpy as np

def findCosineSimilarity(source_representation, test_representation):
    a = np.matmul(np.transpose(source_representation), test_representation)
    b = np.sum(np.multiply(source_representation, source_representation))
    c = np.sum(np.multiply(test_representation, test_representation))
    return 1 - (a / (np.sqrt(b) * np.sqrt(c)))

def _cosine_distance(a, b, data_is_normalized=False):
    if not data_is_normalized:
        a = np.asarray(a) / np.linalg.norm(a, axis=1, keepdims=True)
        b = np.asarray(b) / np.linalg.norm(b, axis=1, keepdims=True)
    return 1. - np.dot(a, b.T)

a = np.array([
    [1,1,1,1],
    [1,1,1,1],
    [1,1,1,1],
    [1,1,1,1]
])

print(findCosineSimilarity(a,a))
print(_cosine_distance(a, a))

Output:

[[0.75 0.75 0.75 0.75]
 [0.75 0.75 0.75 0.75]
 [0.75 0.75 0.75 0.75]
 [0.75 0.75 0.75 0.75]]
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

We see that findCosineSimilarity fails, and that _cosine_distance is correct.

Naphat Amundsen
  • 1,519
  • 1
  • 6
  • 17