Is there a way in R to determine the number of clusters generated without manually specifying it?
After doing some extraction of 'letters' from string values, I subjected my variable with 30000 distinct values into clusters for me to determine which values should be treated the same. Since there are values that supposedly the same but differ in space, punctuation etc. For instance,
Emilia Clarke
Emilia Clark e
should be categorize as 1
I have produced a 30000 x 30000 matrix with elements being the distance of one word to another.
#Get all letters from a string
> extract_letters <- lapply(str_split(data01,""),function(x) names(table(x)))
#Get the distance of . I produced a 30000x30000 matrix
> compute_dist <- adist(extract_letters)
#Cluster
> hc <- hclust(as.dist(compute_dist))
#Plot via dendogram
> plot(hc)
The code below is the one that I am using for smaller data, though, this won't be applicable in here already since I couldn't examine the plot due to large number of inputs. Messy dendograms so I won't be able to detect how many clusters are outputed
> rect.hclust(hc,k=7)
I got no idea on the number of clusters to be generated. I rely on the output of the hclust itself so there's no way for me to do cutree since I need to specify the parameter k
cutree(hc, k = 7)