11

I'm trying to use SciPy's dendrogram method to cut my data into a number of clusters based on a threshold value. However, once I create a dendrogram and retrieve its color_list, there is one fewer entry in the list than there are labels.

Alternatively, I've tried using fcluster with the same threshold value I identified in dendrogram; however, this does not render the same result -- it gives me one cluster instead of three.

here's my code.

import pandas
data = pandas.DataFrame({'total_runs': {0: 2.489857755536053,
1: 1.2877651950650333, 2: 0.8898850111727028, 3: 0.77750321282732704, 4: 0.72593099987615461, 5: 0.70064977003207007, 
6: 0.68217502514600825, 7: 0.67963194285399975, 8: 0.64238326692987524, 9: 0.6102581538587678, 10: 0.52588765899448564,
11: 0.44813665774322564, 12: 0.30434031343774476, 13: 0.26151929543260161, 14: 0.18623657993534984, 15: 0.17494230269731209,
16: 0.14023670906519603, 17: 0.096817318756050832, 18: 0.085822227670014059, 19: 0.042178447746868117, 20: -0.073494398270518693,
21: -0.13699665903273103, 22: -0.13733324345373216, 23: -0.31112299949731331, 24: -0.42369178918768974, 25: -0.54826542322710636,
26: -0.56090603814914863, 27: -0.63252372328438811, 28: -0.68787316140457322, 29: -1.1981351436422796, 30: -1.944118415387774,
31: -2.1899746357945964, 32: -2.9077222144449961},
'total_salaries': {0: 3.5998991340231234,
1: 1.6158435140488829, 2: 0.87501176080187315, 3: 0.57584734201367749, 4: 0.54559862861592978, 5: 0.85178295446270169,
6: 0.18345463930386757, 7: 0.81380836410678736, 8: 0.43412670908952178, 9: 0.29560433676606418, 10: 1.0636736398252848,
11: 0.08930130612600648, 12: -0.20839133305170349, 13: 0.33676911316165403, 14: -0.12404710480916628, 15: 0.82454221267393346,
16: -0.34510456295395986, 17: -0.17162157282367937, 18: -0.064803261585569982, 19: -0.22807757277294818, 20: -0.61709008778669083,
21: -0.42506873158089231, 22: -0.42637946918743924, 23: -0.53516500398181921, 24: -0.68219830809296633, 25: -1.0051418692474947,
26: -1.0900316082184143, 27: -0.82421065378673986, 28: 0.095758053930450004, 29: -0.91540963929213015, 30: -1.3296449323844519,
31: -1.5512503530547552, 32: -1.6573856443389405}})

from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram

distanceMatrix = pdist(data)
dend = dendrogram(linkage(distanceMatrix, method='complete'), 
           color_threshold=4, 
           leaf_font_size=10,
           labels = df.teamID.tolist())

dendrogram

len(dend['color_list'])
Out[169]: 32
len(df.index)
Out[170]: 33

Why is dendrogram only assigning colors to 32 labels, although there are 33 observations in the data? Is this how I extract the labels and their corresponding clusters (colored in blue, green and red above)? If not, how else do I 'cut' the tree properly?

Here's my attempt at using fcluster. Why does it return only one cluster for the set, when the same threshold for dend returns three?

from scipy.cluster.hierarchy import fcluster
fcluster(linkage(distanceMatrix, method='complete'), 4)
Out[175]: 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
Bryan
  • 5,999
  • 9
  • 29
  • 50
  • The docs state about `color_list`: `A list of color names. The k’th element represents the color of the k’th link.` See: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html. I think this is a misunderstanding. – cel Feb 24 '15 at 07:08
  • Given this, how do I see the cluster results if I 'cut' the tree at a threshold of 4? – Bryan Feb 24 '15 at 16:52
  • Also, as this question has been asked on SO many times without a good answer (see my annotations below), I'm going to put it up for a bounty. The specific question is: using a dendrogram in SciPy, how do I cut my data into clusters given a specific threshold level, and then collect these clusters with their corresponding observation label? See http://stackoverflow.com/questions/9708630/some-questions-on-dendrogram-python-scipy?rq=1, and http://stackoverflow.com/questions/10305111/pruning-dendrogram-in-scipy-hierarchical-clustering?rq=1 – Bryan Feb 24 '15 at 16:58
  • Here's another almost exactly similar question without an answer: http://stackoverflow.com/questions/26608412/pruning-dendrogram-at-levels-in-scipy-hierarchical-clustering?rq=1 – Bryan Feb 25 '15 at 19:57

1 Answers1

14

Here's the answer - I didn't add 'distance' as an option to fcluster. With it, I get the correct (3) cluster assignments.

assignments = fcluster(linkage(distanceMatrix, method='complete'),4,'distance')

print assignments
       [3 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

cluster_output = pandas.DataFrame({'team':df.teamID.tolist() , 'cluster':assignments})

print cluster_output
    cluster team
0         3  NYA
1         2  BOS
2         2  PHI
3         2  CHA
4         2  SFN
5         2  LAN
6         2  TEX
7         2  ATL
8         2  SLN
9         2  SEA
10        2  NYN
11        2  HOU
12        1  BAL
13        2  DET
14        1  ARI
15        2  CHN
16        1  CLE
17        1  CIN
18        1  TOR
19        1  COL
20        1  OAK
21        1  MIL
22        1  MIN
23        1  SDN
24        1  KCA
25        1  TBA
26        1  FLO
27        1  PIT
28        1  LAA
29        1  WAS
30        1  ANA
31        1  MON
32        1  MIA
Bryan
  • 5,999
  • 9
  • 29
  • 50
  • 2
    This problem has been bothering me too. I have finally worked it out that the order of the fcluster is the same as the order of the input distance matrix, not the same as the leaves from left to right in the dendrogram. – user3479780 Jul 25 '16 at 05:44
  • the parameter criterion is very important, different criterion set different convergence criterion – Statham Dec 13 '18 at 13:25