3

I'm using the distance.cosine function from the scipy.spatial python package. The problem is that my code returns me some values which are more than one. How is that possible?

My code is very simple but that's it:

for i in range(len(vec.split(","))):
    w1=vec.split(",")[i]
    vec_1=embedding.get_phrase_vector(w1)/np.linalg.norm(embedding.get_phrase_vector(w1))
        for j in range(len(vec.split(","))):
            w2=vec.split(",")[j]
            vec_2=embedding.get_phrase_vector(w2)/np.linalg.norm(embedding.get_phrase_vector(w2))
            matrix[i][j]=distance.cosine(vec_1,vec_2)

the two vector giving me problems are:

w1=[-0.29137    1.0635    -0.41772    0.10439    0.46724    0.28249
 -0.04234   -0.07716    0.31482   -0.31903   -0.15905    0.98593
  0.40408   -0.33376    0.11372    0.3485     0.28884    0.082693
  0.86843   -0.40946   -0.64101   -0.55062    0.15105   -0.16613
  0.88421    0.31586    0.0017234 -0.46789   -0.48933   -0.38975
 -0.48061   -0.086691   0.96367    0.13027    0.10883    0.13111
 -0.28605    0.32731    0.10249   -0.50631   -0.27578    0.053391
  0.45665   -0.11782    0.039271   0.27073    0.46305    0.66542
 -0.41682   -0.14791   -0.9136    -0.71694   -0.11963    0.095209
  0.21016    0.67604   -0.23403   -0.39308    0.34853   -0.91753
  0.73017    0.79334   -0.25474    0.51577   -1.0458    -0.59653
 -0.54101   -0.056912   0.01262    0.046881   0.0708     0.20313
 -0.34206   -0.62316   -0.48464    0.013741   0.057855  -0.29289
 -0.1755     0.059357  -0.01446    0.17238    0.065214   0.4437
  0.38186   -0.21588    0.55824    0.099175  -0.0094545  0.82726
 -0.4048    -0.47035   -0.16345    0.080469  -0.048781   0.091551
  0.67828   -0.56955   -0.024643  -0.51526  ]
w2=[-1.6486e-01  9.1997e-01  2.2737e-01 -4.9031e-01 -1.8082e-03 -3.3803e-01
  5.7221e-02  1.4601e-01  4.0202e-01 -2.8858e-01 -4.7495e-01 -5.6369e-01
  2.7037e-01  5.1702e-01 -1.1241e-01  1.8314e-01  2.2066e-01 -4.8606e-01
 -8.7284e-01 -6.2587e-02  4.3016e-02  2.3641e-01  5.9705e-01 -3.8640e-01
 -2.5194e-01  9.6862e-01 -4.3112e-01 -4.8370e-01 -1.1396e+00  9.2425e-02
 -1.1476e-01 -7.4291e-02 -6.2524e-02 -9.5122e-02 -2.2714e-01  8.8291e-01
  3.9978e-01  7.6631e-01 -6.7697e-01 -6.2829e-01 -1.1872e-01 -2.4492e-01
 -5.8893e-01 -8.5088e-01  1.1107e+00  4.2190e-01 -1.5072e+00 -1.9509e-01
 -2.6712e-01 -7.0801e-01  5.5075e-01 -4.6929e-02 -2.5203e-01  7.4411e-01
 -1.8325e-01 -1.4885e+00 -4.6393e-01 -1.0338e-01  2.3525e+00 -1.5421e-01
  3.9833e-01  1.5344e-02  8.0708e-02 -2.7373e-01  9.7057e-01 -1.9383e-02
  2.0899e-01 -6.4033e-01  9.2509e-01 -4.5371e-01 -7.0564e-01 -1.6033e-01
 -7.1761e-02  6.2856e-01  3.5732e-01  8.8802e-01 -6.9127e-01  4.9634e-02
 -9.3347e-01  6.5396e-01  3.7165e-01  5.8363e-02 -1.0152e+00  7.0845e-01
 -1.3542e+00 -3.6390e-01  2.5994e-01 -1.8260e-01 -9.8930e-01 -4.4699e-01
  8.5016e-01  9.4532e-02  3.7019e-01 -5.0354e-01 -1.2083e+00 -3.5776e-01
  2.3899e-01 -6.7904e-02  1.5072e+00  6.0889e-01]

and their disctance results 1.08074426763993081

Barbamento
  • 33
  • 1
  • 5

1 Answers1

4

If dot product of these vectors is negative, it's perfectly OK for cosine to return a value greater than 1 (see the formula used for cosine in the documentation)

For example:

from scipy.spatial.distance import cosine

cosine([1], [-1])

Output:

2.0
perl
  • 9,826
  • 1
  • 10
  • 22