2

I compared the silhouette widths of different cluster algorithms on the same dataset: k-means, clara and pam. I can see which one scores the highest on silhouette width. But can I now statistically test whether the solutions differ from each other kind of as we normally do with ANOVA?

I formulated the hypothesis for my thesis that clara and pam would give more valid results than k-means. I know the silhouette width of both of them is higher, but I don't know how I can statistically confirm/disconfirm my hypothesis.

 #######4: Behavioral Clustering
 ##4.1 Kmeans
 kmeans.res.4.1 <- kmeans(ClusterDFSBeha, 2)
 print(kmeans.res.4.1)
 #Calculate SW
 library(clValid)
 intern4.1 <- clValid(ClusterDFSBeha, 2, clMethods="kmeans",validation="internal", maxitems = 9800)
 summary(intern4.1)
 #Silhouette width = 0.7861

##4.2 PAM
pam.res.4.2 <- pam(ClusterDFSBeha, 2)
print(pam.res.4.2)
intern4.2 <- clValid(ClusterDFSBeha, 2, clMethods="pam", validation="internal", maxitems = 9800)
summary(intern4.2)
#Silhouette width = 0.6702

##4.3 Clara
clara.res.4.3 <- clara(ClusterDFSBeha,2)
print(clara.res.4.3)
intern4.3 <- clValid(ClusterDFSBeha, 2, clMethods="clara", validation="internal", maxitems = 9800)
summary(intern4.3)
#Silhouette width = 0.8756

Now I would like to statistically assess whether the methods statistically 'differ' from each other to be able to reject or approve my hypothesis with a certain p level.

3 Answers3

0

It is not a perfect answer.

If you want to test the "quality" of a clustering method, the better thing is to look at the partition given by the algorithm.

For the checking you can compare partition through measure like ARI (Adjusted Rank Index), we call that relative performance. Another idea is to use simulated data where you know true label and thanks to them you can compare your result, how far you are from the truth. The last one, I know is to asses the stability of your clustering method to small perturbation of the data: the gap algorithm of Rob Tibshirani.

But in fact in clustering theory (unsupervised classification) it is really hard to evaluate the pertinency of a cluster. We have fewer selection model criteria than for supervised learning task.

I really advised you to look on internet, for instance this package description seems to be a good inroduction :

https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf

To answer directly, I don't think that what you are looking for exist. If yes, I will be really happy to know more about it.

Rémi Coulaud
  • 1,684
  • 1
  • 8
  • 19
0

Such a comparison will never be fair.

Any such test makes some assumptions, and a clustering method that is based on similar assumptions is to be expected to score better.

For example if you use Silhouette with Euclidean distance, PAM with Euclidean distance, and k-means, it must be expected that PAM has an advantage. If you used Silhouette with squared Euclidean distances instead, k-means is almost certain going to fare best (and it is also almost certain to outperform PAM with squared Euclidean).

So you aren't judging which method is "better", but which correlates more with your evaluation method.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thank you so much. I understand.. Do you have any idea how I could compare them on internal validity without being unfair? Should I maybe use the manhattan distance such that kmeans, pam or clara does not have an advantage? If so, how should I change this setting in the clValid package? This is the one I used to calculate the Silhouette width with. – Irma Kootstra May 02 '19 at 21:24
  • Or should I just mention this as a limitation of my research? Since in the clValid package article they also compare kmeans, pam and hierarchical clustering (http://www.sthda.com/english/wiki/print.php?id=243). There they also don’t specify this risk of correlation between the algorithm and the measure. Not sure what to do now, I do understand the issue however. So far thanks for your help! – Irma Kootstra May 02 '19 at 21:25
  • Manhattan is not independent of Euclidean, and that isn't really what you want. K-medians will likely win over the other with Manhattan. So you'd then measure which method is most like kmedians... What you need to embrace is that there is not *one* best answer. Every measure and every method has a different preference. Think of them as colors, is yellow better than green because it is more red? – Has QUIT--Anony-Mousse May 02 '19 at 23:34
  • Hmm yeah exactly, thanks. I’ll just need to explain it carefully indeed. That there is no one best measure to compare them on and why I choose this one. Thanks for your help, I really appreciate that! – Irma Kootstra May 03 '19 at 11:16
0

There is a simple way using the contingency tables. Let's say you get 1 set of cluster assignments ll and another cc and in the ideal situation you have those labels align perfectly and from that table you can produce a statistic using the chi squared test and the pvalue for the significance of the allocation differences;

ll1 = rep(c(4,3,2,1),100)
cc1 = rep(c(1:4),length(ll1)/4)
table(cc1, ll1)
print(paste("chi statistic=",chisq.test(cc1, ll1)$statistic ))
print(paste("chi pvalue=",chisq.test(cc1, ll1)$p.value ))

producing;

 ll1
cc1   1   2   3   4
  1   0   0   0 100
  2   0   0 100   0
  3   0 100   0   0
  4 100   0   0   0
[1] "chi statistic= 1200"
[1] "chi pvalue= 1.21264177763119e-252"

meaning that the cell counts are not randomly (uniformly) allocated supporting an association. For a random allocation;

ll2 = sample(c(4,3,2,1),100,replace=TRUE)
cc2 = sample(c(1:4),length(ll2),replace=TRUE)
table(cc2, ll2)
print(paste("chi statistic=",chisq.test(cc2, ll2)$statistic ))
print(paste("chi pvalue=",chisq.test(cc2, ll2)$p.value ))

with outputs

   ll2
cc2  1  2  3  4
  1  6  7  6 10
  2  5  5  7  9
  3  6  7  7  4
  4  4  8  5  4
[1] "chi statistic= 4.96291083483202"
[1] "chi pvalue= 0.837529350518186"

supporting that there is no association.

You can use this for your cluster assignments from different algorithms, to see if they are randomly associated or not.

You can also use; ** Variation of Information Distance for Clusterings** to get the distance between the assignments. for ll1 and cc1 ('mcclust' R package)

 vi.dist(bb1,cc1)
 vi.dist(bb1,cc1, parts=TRUE)

you get

0
vi0H(1|2)0H(2|1)0

and for the sampled ll2 and cc2

vi.dist(aa2,cc2)
vi.dist(aa2,cc2, parts=TRUE)
3.68438190593985
vi3.68438190593985H(1|2)1.84631473075115H(2|1)1.83806717518869

There's also the V-measure you can apply

Vass
  • 2,682
  • 13
  • 41
  • 60