-1

I am trying to use kmeans clustering using the levenshtein distance. I am having hard time in interpeting the results.

   # courtesy: code is borrowed from the other thread listed below with some additions of k-means clustering 
      set.seed(1)
  rstr <- function(n,k){   # vector of n random char(k) strings
 sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
  }

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
    # Levenshtein Distance
  d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)

# to normalize the distances when there are unequal length sequences 
max<- max(d)
data<- d/max

k.means.fit <- kmeans(data, 3)
library(cluster)
clusplot(d, k.means.fit$cluster, main='Clustering',
     color=TRUE, shade=TRUE,
     labels=5, lines=0, col.p = "dark green")

so, what does the cluster plot and how can I interpret it? I referred to other threads where they discuss that is clustered on two principal components. https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r

But it was not clear how to explain the figure and why those points are in that ellipse/ cluster? Any ideas? Thanks!!

user3570187
  • 1,743
  • 3
  • 17
  • 34
  • This is probably a better fit for [stats.se], since it looks like you've got the code already, just need help interpreting – camille Aug 28 '19 at 23:48
  • Great comment, I posted it, also is there a way to plot the graph visually appealing with k-means and plotting few points/ patterns that are relevant? – user3570187 Aug 28 '19 at 23:54
  • Visual appeal is subjective, and what's relevant depends on context, so I'm not sure about those. But there are several posts on CV that should probably help, especially in their pca and kmeans tags – camille Aug 29 '19 at 00:12
  • @camille please only recommend to *move* (migrate) questions, do not encourage posting duplicates. Thank you. – Has QUIT--Anony-Mousse Aug 29 '19 at 19:26
  • K-means expects a *continuous data matrix* as input and *only* uses squared Euclidean, *not* a distance matrix. While the result won't be obviously wrong, this approach does not make a lot of sense formally, as this creates all kinds of bias. Adding new objects changes the similarities of existing objects etc. - don't do this. Use KMeans only on appropriate data matrixes, not distance matrixes. – Has QUIT--Anony-Mousse Aug 29 '19 at 19:31
  • Your normalization is also flawed and doesn't do what you claim it does (adjust for different lengths)... – Has QUIT--Anony-Mousse Aug 29 '19 at 19:33
  • So what are you proposing to do for distance matrix? Which clustering? Selection of clusters? – user3570187 Aug 29 '19 at 19:35
  • @Anony-Mousse I didn't mean they should double post. I wasn't positive it should be moved, so I didn't vote to close – camille Aug 29 '19 at 20:30
  • @camille unless you ask to not duplicate, pointing a new user to another site usually leads to duplication there, unfortunately. – Has QUIT--Anony-Mousse Aug 30 '19 at 00:53
  • Sorry my mistake! I misinterpreted it! – user3570187 Aug 30 '19 at 00:55

1 Answers1

1

This is pretty straightforward. You constructed your strings to be in three groups. You have ten strings that start with 'aa', ten with 'bb' and ten with 'cc'. After those beginnings, the rest of the string is random. Using Levenshtein distance, you would expect these strings that start with the same first two letters to be close to each other. When you look at the plot of the hierarchical clustering it is easy to see three main groups defined by the first two letters of the strings. When you use kmeans with k=3, you get the same clusters. You can see this by checking the clusters

 k.means.fit$cluster
aagjo aaxfx aayrq aabfe aarju aamsz aajuy aafqd aagka aajwi bbmpm bbevr bbucs 
    1     1     1     1     1     1     1     1     1     1     3     3     3 
bbkvq bbuon bbuam bbtsm bbwlg bbbci bbnrk ccxhl cciqg ccmtc ccwiv ccjim ccxwk 
    3     3     3     3     3     3     3     2     2     2     2     2     2 
ccuyl ccski cctfs ccdgd 
    2     2     2     2 

Cluster 1 is the strings that start with 'aa' cluster 2 starts with 'cc' and cluster 3 starts with 'bb'.

G5W
  • 36,531
  • 10
  • 47
  • 80