Regarding your comment, the cosine distance of two matrices of shape 2 x 5
essentially consists of finding the pairwise cosine distance between the vectors in each array. Assuming you are working with row vectors (which you should when you use NumPy conventionally), the expected output should consist of 2 * 2 = 4
elements. If you are working with column vectors, then 5 * 5 = 25
elements makes sense.
_cosine_distance looks good
The function _cosine_distance
is correct in naming and implementation generally for all cases where a in N^{n x l}
and b in N^{m x l}
.
To use _cosine_distance
for 1D arrays you can simply add a singleton dimension at axis 0, e.g. _cosine_distance(a[np.newaxis], b[np.newaxis])
.
findCosineSimilarity looks bad
findCosineSimilarity
is incorrect in naming (it calculates the cosine distance), and the implementation only works if you have one dimensional arrays. Using this for anything other than 1D arrays will fail as it will compute something that is incorrect by the definition of cosine distance. Also, transposing source_representation
(the left matrix) hints that the function is meant for column vectors, which differs from _cosine_distance
, not that findCosineSimilarity
would work for matrices anyways.
It is easy to create a column/row vector agnostic test case by using a n x n
matrix:
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
If we calculate the pairwise cosine distance for every vector in the matrix we should get all zeros as the vectors are the same.
import numpy as np
def findCosineSimilarity(source_representation, test_representation):
a = np.matmul(np.transpose(source_representation), test_representation)
b = np.sum(np.multiply(source_representation, source_representation))
c = np.sum(np.multiply(test_representation, test_representation))
return 1 - (a / (np.sqrt(b) * np.sqrt(c)))
def _cosine_distance(a, b, data_is_normalized=False):
if not data_is_normalized:
a = np.asarray(a) / np.linalg.norm(a, axis=1, keepdims=True)
b = np.asarray(b) / np.linalg.norm(b, axis=1, keepdims=True)
return 1. - np.dot(a, b.T)
a = np.array([
[1,1,1,1],
[1,1,1,1],
[1,1,1,1],
[1,1,1,1]
])
print(findCosineSimilarity(a,a))
print(_cosine_distance(a, a))
Output:
[[0.75 0.75 0.75 0.75]
[0.75 0.75 0.75 0.75]
[0.75 0.75 0.75 0.75]
[0.75 0.75 0.75 0.75]]
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
We see that findCosineSimilarity
fails, and that _cosine_distance
is correct.