0

I am using silhouette scores as a post-hoc measure of cluster validity for clusters derived from DBSCAN, but the metric fails to accurately capture what is happening in a particular situation that occurs in my data, and I am looking for alternatives.

The issue occurs when there are two nearby, but clearly separated clusters, which a silhouette score marks as not well distinguished. I present an example below. Is there an alternative metric that is used in clustering that might resolve this issue with silhouette scores?

library(dplyr)
library(ggplot2)

n = 1000
x = rnorm(n, mean = 0, sd = 0.5)
y = rnorm(n, mean = 0, sd = 0.5)


split = 0.05

df = data.frame(x = x, y = y)
df = df %>% 
  dplyr::filter(x < -1*split | x > split) %>% 
  mutate(
    group = ifelse(x < split, 1, 2)
  )

plot(df$x, df$y, col = factor(df$group))

silhouette_score = cluster::silhouette(df$group, dist = dist(df$x, df$y)) %>% 
  as.data.frame()

silhouette_score %>% 
  group_by(cluster) %>% 
  summarise(mean(sil_width))

ggplot(silhouette_score, aes(x = sil_width)) + 
  geom_histogram() + 
  facet_grid(~cluster)
SamPassmore
  • 1,221
  • 1
  • 12
  • 32

0 Answers0