2

I have a co-occurrence symmetric matrix (1877 x 1877). I treat columns as features and compute the cosine distance between them. Before that, I scale the matrix (center to the mean and component wise scale to unit variance).

from sklearn import preprocessing
from sklearn.metrics import pairwise_distances
X_scaled = preprocessing.scale(mymatrix)
dist = pairwise_distances(X_scaled,metric="cosine")

My questions:

  1. Should I scale the co-occurrence data before computing the cosine distance/sim? The figure below shows the histograms of the actual matrix. The x-axis represents co-occurrence values in the matrix, and y-axis indicates the number of times they appear in the matrix. enter image description here
  2. The code above returns distance > 1 and distance < 0. How can I ensure that the cosine distance values between 0 and 1? Should I apply min max scaler over the dist matrix?
kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30

0 Answers0