I have a co-occurrence symmetric matrix (1877 x 1877). I treat columns as features and compute the cosine distance between them. Before that, I scale the matrix (center to the mean and component wise scale to unit variance).
from sklearn import preprocessing
from sklearn.metrics import pairwise_distances
X_scaled = preprocessing.scale(mymatrix)
dist = pairwise_distances(X_scaled,metric="cosine")
My questions:
- Should I scale the co-occurrence data before computing the cosine
distance/sim? The figure below shows the histograms of the actual matrix. The x-axis represents co-occurrence values in the matrix, and y-axis indicates the number of times they appear in the matrix.
- The code above returns distance > 1 and distance < 0. How can I ensure that the cosine distance values between 0 and 1? Should I apply min max scaler over the dist matrix?