calculate Silhouette Score of the scipy's fcluster using scikit-learn silhouette score

Question

I am doing hierarchy clustering using scipy.cluster followed by fcluster under different cutoff. I want to also use scikit's silhouette_score. I see the post How to calculate Silhouette Score of the scipy's fcluster using scikit-learn silhouette score? However, i got error "too many boolean indices"??

My codes is following:

import fastcluster
from sklearn import metrics
from scipy.cluster import hierarchy as hac


Temps=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Distance=[]
#read the Distance obtained as a list then
Distances=np.array(Distances)
Z=fastcluster.linkage(Distances, "complete", "euclidean")
for Cutoff in Temps:
    results=hac.fcluster(Z,Cutoff,'distance')
    metrics.silhouette_score(Distances, results, metric="euclidean")

The error report was:

Traceback (most recent call last):
  File "Clustering_2.py", line 93, in <module>
    main(argv)
  File "Clustering_2.py", line 69, in main
    silscore=metrics.silhouette_score(Distances, results,metric='euclidean')
  File "/home/wangz18/site-packages2/sklearn/metrics/cluster/unsupervised.py", line 93, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/wangz18/site-packages2/sklearn/metrics/cluster/unsupervised.py", line 157, in silhouette_samples
    for i in range(n)])
  File "/home/wangz18/site-packages2/sklearn/metrics/cluster/unsupervised.py", line 187, in _intra_cluster_distance
    a = np.mean(distances_row[mask])
ValueError: too many boolean indices

what's the problem? please advise. Thanks

this code sample is incomplete, can you add the imports for `fastcluster` and `hac` and the definition of `Distances` and `Cutoff`? — maxymoo, May 11 '16 at 23:59
also add the traceback for too many boolean indices, please! — joeln, May 12 '16 at 03:46
Are you sure you have the right input to `silhouette_score`? According to the [docs](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), the first argument X needs to be *X : array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise* . Which I take to mean that unless you set `metric` to “precomputed” it expects a matrix of features, not distances. — patrick, May 17 '16 at 16:05
yes, that's what I mean. If you look at the docs, I don't think it expects you to put in a distance matrix but your original featurematrix: [n_samples_a, n_features] — patrick, May 17 '16 at 19:19
Thanks! i think should do one more step Dicts =squareform(Distances), then use the Dicts for metrics.silhouette_score — user1830108, May 17 '16 at 20:14
I'm not familiar with what squareform does, but looks like that will give you yet another distance matrix. My understanding would be that just changing your code to `metrics.silhouette_score(Distances, results, metric="precomputed")` should get rid of [the|an] error. Cf docs: *If X is the distance array itself, use metric="precomputed"* — patrick, May 18 '16 at 13:09

score 0 · Answer 1 · answered Mar 14 '19 at 16:15

0

I have the same question, and please check:

Distance is N*N, N is the number of samples
results is N, and the value is the class of cluster
The number of cluster should be > 1

If #1 and #2 are correct, them it should be correct.

answered Mar 14 '19 at 16:15

Lilly Wu

11
1

calculate Silhouette Score of the scipy's fcluster using scikit-learn silhouette score

1 Answers1