0

I am doing hierarchy clustering using scipy.cluster followed by fcluster under different cutoff. I want to also use scikit's silhouette_score. I see the post How to calculate Silhouette Score of the scipy's fcluster using scikit-learn silhouette score? However, i got error "too many boolean indices"??

My codes is following:

import fastcluster
from sklearn import metrics
from scipy.cluster import hierarchy as hac


Temps=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Distance=[]
#read the Distance obtained as a list then
Distances=np.array(Distances)
Z=fastcluster.linkage(Distances, "complete", "euclidean")
for Cutoff in Temps:
    results=hac.fcluster(Z,Cutoff,'distance')
    metrics.silhouette_score(Distances, results, metric="euclidean")

The error report was:

Traceback (most recent call last):
  File "Clustering_2.py", line 93, in <module>
    main(argv)
  File "Clustering_2.py", line 69, in main
    silscore=metrics.silhouette_score(Distances, results,metric='euclidean')
  File "/home/wangz18/site-packages2/sklearn/metrics/cluster/unsupervised.py", line 93, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/wangz18/site-packages2/sklearn/metrics/cluster/unsupervised.py", line 157, in silhouette_samples
    for i in range(n)])
  File "/home/wangz18/site-packages2/sklearn/metrics/cluster/unsupervised.py", line 187, in _intra_cluster_distance
    a = np.mean(distances_row[mask])
ValueError: too many boolean indices

what's the problem? please advise. Thanks

sheldonzy
  • 5,505
  • 9
  • 48
  • 86
user1830108
  • 195
  • 1
  • 15
  • this code sample is incomplete, can you add the imports for `fastcluster` and `hac` and the definition of `Distances` and `Cutoff`? – maxymoo May 11 '16 at 23:59
  • also add the traceback for too many boolean indices, please! – joeln May 12 '16 at 03:46
  • The changes has been made asked. Thanks! – user1830108 May 17 '16 at 15:51
  • Are you sure you have the right input to `silhouette_score`? According to the [docs](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), the first argument X needs to be *X : array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise* . Which I take to mean that unless you set `metric` to “precomputed” it expects a matrix of features, not distances. – patrick May 17 '16 at 16:05
  • Distances is a condensed distance matrix – user1830108 May 17 '16 at 16:34
  • 1
    yes, that's what I mean. If you look at the docs, I don't think it expects you to put in a distance matrix but your original featurematrix: [n_samples_a, n_features] – patrick May 17 '16 at 19:19
  • Thanks! i think should do one more step Dicts =squareform(Distances), then use the Dicts for metrics.silhouette_score – user1830108 May 17 '16 at 20:14
  • I'm not familiar with what squareform does, but looks like that will give you yet another distance matrix. My understanding would be that just changing your code to `metrics.silhouette_score(Distances, results, metric="precomputed")` should get rid of [the|an] error. Cf docs: *If X is the distance array itself, use metric="precomputed"* – patrick May 18 '16 at 13:09

1 Answers1

0

I have the same question, and please check:

  1. Distance is N*N, N is the number of samples

  2. results is N, and the value is the class of cluster

  3. The number of cluster should be > 1

If #1 and #2 are correct, them it should be correct.

Lilly Wu
  • 11
  • 1