0

Is there a way in R to determine the number of clusters generated without manually specifying it?

After doing some extraction of 'letters' from string values, I subjected my variable with 30000 distinct values into clusters for me to determine which values should be treated the same. Since there are values that supposedly the same but differ in space, punctuation etc. For instance,

Emilia Clarke
Emilia Clark e

should be categorize as 1

I have produced a 30000 x 30000 matrix with elements being the distance of one word to another.

#Get all letters from a string
 > extract_letters <- lapply(str_split(data01,""),function(x) names(table(x)))
#Get the distance of . I produced a 30000x30000 matrix
 > compute_dist  <- adist(extract_letters)
#Cluster
 > hc <- hclust(as.dist(compute_dist))
#Plot via dendogram
 > plot(hc)

Kindly see the result dendogram

The code below is the one that I am using for smaller data, though, this won't be applicable in here already since I couldn't examine the plot due to large number of inputs. Messy dendograms so I won't be able to detect how many clusters are outputed

> rect.hclust(hc,k=7)

I got no idea on the number of clusters to be generated. I rely on the output of the hclust itself so there's no way for me to do cutree since I need to specify the parameter k

cutree(hc, k = 7)
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
icychamp
  • 70
  • 8
  • 1
    I assume you are doing this to use some model for inference or forecasting. Then the clustering can be considered part of the model and the number of clusters can be optimized based on (cross)validation. – Roland Oct 31 '16 at 08:49
  • @Roland , I am doing this to categorize values that are possibly the same. – icychamp Oct 31 '16 at 09:01
  • I understand that, but it's probably not the ultimate goal. Why do they need to categorized? How do you check if categorization works well? – Roland Oct 31 '16 at 09:03
  • @Roland, to standardize the inputs for certain variable – icychamp Oct 31 '16 at 09:05
  • Please try to understand where I'm coming from. I could continue asking "and why do you do that" until we arrive at your actual goal, but I'll stop now. – Roland Oct 31 '16 at 09:07
  • Please look at the find_k function in the dendextend package. – Tal Galili Oct 31 '16 at 17:38

1 Answers1

1

A lot of indices have been introduced to determine the number of clusters. Most common method indices are gap index , CH index , DB index , silhouette index.
Most of these indexes are trying to maximize the inter-cluster variation while minimize the intra-cluster variation.

in r NbClust package introduces around 30 indices to determine the number of cluster for hierarchical and k-means clustering method.You can read more on NbClust package https://cran.r-project.org/web/packages/NbClust/NbClust.pdf

Nipun Wijerathne
  • 1,839
  • 11
  • 13