I am trying to use kmeans clustering using the levenshtein distance. I am having hard time in interpeting the results.
# courtesy: code is borrowed from the other thread listed below with some additions of k-means clustering
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
# to normalize the distances when there are unequal length sequences
max<- max(d)
data<- d/max
k.means.fit <- kmeans(data, 3)
library(cluster)
clusplot(d, k.means.fit$cluster, main='Clustering',
color=TRUE, shade=TRUE,
labels=5, lines=0, col.p = "dark green")
so, what does the cluster plot and how can I interpret it? I referred to other threads where they discuss that is clustered on two principal components. https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r
But it was not clear how to explain the figure and why those points are in that ellipse/ cluster? Any ideas? Thanks!!