-1

I have the data in 50 by 50 Matrix that represents the 50 Journals with their correlation. Now, I am trying to plot the graph showing on which clusters those 50 Journals fall based on the data.

1) I prefer to use complete-linkage or Ward's method to do the clusters. 2) I am stuck at where to begin the clustering as the documentation in scikit-learn is too technical for me 3) Could you please help me to give a kick-start?

Thank you very much in advance...

My all data falls between -1 and 1 as it is correlation coefficients.

Example of Data Sample (50*50):

data = [[ 1. 0.49319094 0.58838586 ..., 0.11433441 0.6450184 0.60842821]

[ 0.49319094 1. 0.39311674 ..., -0.00795401 0.42944597 0.68855177]

[ 0.58838586 0.39311674 1. ..., 0.39785574 0.864322 0.68910632]

...,

[ 0.11433441 -0.00795401 0.39785574 ..., 1. 0.38623474 0.34228516]

[ 0.6450184 0.42944597 0.864322 ..., 0.38623474 1. 0.65408474]

[ 0.60842821 0.68855177 0.68910632 ..., 0.34228516 0.65408474 1. ]]

Amitsd
  • 29
  • 4

1 Answers1

0

Python expects distances, i.e. low values are better.

Ward is designed for squared Euclidean, so while it may work with correlation, the support from theory may be weak. Complete linkage will be supported.

What about negative correlations - how do you want to treat them?

I believe I know three popular transformations:

  1. 1 - p**2 (depending on the implementation, this may be a good choice with Ward because of the square)
  2. 1 - abs(p)
  3. 1 - p (This will treat negative correlations as bad!)

Make sure to set your metric to precomputed. And get used to reading sklearn documentation. It is one of the least technical you will find, so you better become more technical yourself then.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thank you very much for response. Yes, I plan to use complete linkage for clustering. As read in many experiments, I think, 1-abs(p) will be the best way to deal with the negative correlations before clustering the data set. – Amitsd Oct 23 '16 at 08:15