2

I want to run hierarchical clustering with single linkage to cluster documents with 300 features and 1500 observations. I want to find the optimal number of clusters for this problem.

The below link uses the below code to find the number of clusters with max gap.

http://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning

# Compute gap statistic 
set.seed(123)

iris.scaled <- scale(iris[, -5])

gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)

# Plot gap statistic 
fviz_gap_stat(gap_stat)

But in the link hcut is not clearly defined. How can I specify single linkage hierarchical clustering to the clusGap() function?

Do we have an equivalent of clusGap() in python?

Thanks

GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80

1 Answers1

4

The hcut() function is part of the factorextra package used in the link you posted:

hcut package:factoextra R Documentation

Computes Hierarchical Clustering and Cut the Tree

Description:

 Computes hierarchical clustering (hclust, agnes, diana) and cut
 the tree into k clusters. It also accepts correlation based
 distance measure methods such as "pearson", "spearman" and
 "kendall".

R also has a built-in function, hclust(), which can be used to perform hierarchical clustering. By default, however, it does not perform single-linkage clustering, so you can't simply replace hcut with hclust.

If you look at the help for clusGap(), however, you will see that you can provide a custom clustering function to be applied:

FUNcluster: a ‘function’ which accepts as first argument a (data) matrix like ‘x’, second argument, say k, k >= 2, the number of clusters desired, and returns a ‘list’ with a component named (or shortened to) ‘cluster’ which is a vector of length ‘n = nrow(x)’ of integers in ‘1:k’ determining the clustering or grouping of the ‘n’ observations.

The hclust() function is able to perform single-linkage hierarchical clustering, so you can do:

cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)
Keith Hughitt
  • 4,860
  • 5
  • 49
  • 54