3

I computed tf-idf of my documents based of terms. Then, I applied LSA to reduce the dimensionality of the terms. 'similarity_dist' contains values which are negative (see table below). How can I compute cosine distance with the range 0-1?

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, tokenizer=tokenize_and_stem, stop_words='english')
%time tf = tf_vectorizer.fit_transform(descriptions)
print(tf.shape)
svd  = TruncatedSVD(100)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
tfidf_desc = lsa.fit_transform(tfidf_matrix_desc)
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))

similarity_dist = cosine_similarity(tfidf_desc)
pd.DataFrame(similarity_dist,index=descriptions.index, columns=descriptions.index).head(10)

print(tfidf_matrix_desc.min(),tfidf_matrix_desc.max())
#0.0 0.736443429828

print(tfidf_desc.min(),tfidf_desc.max())
#-0.518015429416 0.988306783341

print(similarity_dist.max(),similarity_dist.min())
#1.0 -0.272010919022

enter image description here

kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30
  • Weird. Tfidfs are expected to be positive, thus the cosine should be among 0 and 1. We usually normalize to make the cosine easier to calculate, since it is a dot product for a normalized matrix. Your code does not show the cosine call and also is using svd instead of lda as you stated. Could you post the whole code you are actually using? – Rabbit May 27 '16 at 00:28
  • Sorry, i applied LSA, not LDA. I have updated the code. 'tfidf_matrix_desc' values are between 0-1, tfidf_desc contains negative values (see print statements). – kitchenprinzessin May 30 '16 at 02:01

1 Answers1

9

cosine_similarity is in the range of -1 to 1

cosine distance is defined as:

cosine_distance = 1 - cosine_similarity 

hence cosine_distance will be in the range of: 0 to 2

See https://en.wikipedia.org/wiki/Cosine_similarity

Cosine distance is a term often used for the complement in positive space, that is: D_C(A,B) = 1 - S_C(A,B).

Note: if you must have it in the range of 0 to 1, you can use cosine_distance / 2

Yaron
  • 10,166
  • 9
  • 45
  • 65
  • Can you please explain while the distance should be divided into 2? The cosine similarity between doc0,doc0 (table above) is 1, so i think the sklearn method returns similarity in a positive space, or am i missing something? – kitchenprinzessin May 26 '16 at 09:33
  • 1
    cosine_similarity is defined as value between -1 to 1, cosine_distance is defined as: 1 - cosine_similarity --> hence cosine_distance range is 0 to 2. – Yaron May 26 '16 at 09:50