0

I am trying to do hierarchical clustering in Python over a collection of documents. I used scipy.cluster.hierarchy with method=average and metric=cosine as bellow:

distMatrix = pairwise_distances(X_normalized, metric='cosine') 
L = fastcluster.linkage(distMatrix, method='average')

I have problem interpreting the output of the linkage method, since some distances are more than one. How is that possible when the metric I am using is cosine? Isn't it supposed to be less than or equal to 1?

[[ 7. 22. 0. 2. ] [ 14.
27. 0. 2. ] [ 33. 34. 0.266383 2. ] [ 2. 12. 0.77866776 2. ] [ 18. 20. 1.09118911 2. ] [ 0.
6. 1.09586741 2. ] [ 26. 30. 1.09711324 2. ] [ 32. 42. 1.12491309 3. ] [ 15. 16. 1.12715133 2. ] [ 5.
21. 1.18961564 2. ] [ 4. 8. 1.21144117 2. ] [ 3. 24. 1.21711052 2. ] [ 9. 17. 1.26018569 2. ] [ 1.
23. 1.27712536 2. ] [ 35. 41. 1.34423149 3. ] [ 13. 45. 1.36113739 3. ] [ 28. 46. 1.38535987 3. ] [ 29.
40. 1.40081718 3. ] [ 31. 44. 1.42614738 3. ] [ 25. 51. 1.42704815 4. ] [ 11. 50. 1.43200913 4. ] [ 10.
53. 1.44240297 4. ] [ 47. 54. 1.4833146 5. ] [ 19. 55. 1.48739052 5. ] [ 48. 52. 1.49125894 5. ] [ 49.
59. 1.50473572 7. ] [ 58. 60. 1.55300865 10. ] [ 57. 62. 1.56317408 14. ] [ 56. 61. 1.5656443 11. ] [ 63.
64. 1.58042986 25. ]

Nima
  • 71
  • 1
  • 4
  • What are the values in X_normalized? This will happen if the dot product of two feature vectors is negative, which can happen if some of the entries are negative. If you're normalizing to mean 0, sd 1, that will happen. You probably don't need to normalize at all because cosine distance already does that by dividing the dot product by the norms of the vectors. – Adam Acosta Nov 16 '15 at 21:34
  • Thank you Adam! You are right. I am in fact standardizing my data. – Nima Nov 17 '15 at 00:57

0 Answers0