I am using silhouette scores as a post-hoc measure of cluster validity for clusters derived from DBSCAN, but the metric fails to accurately capture what is happening in a particular situation that occurs in my data, and I am looking for alternatives.
The issue occurs when there are two nearby, but clearly separated clusters, which a silhouette score marks as not well distinguished. I present an example below. Is there an alternative metric that is used in clustering that might resolve this issue with silhouette scores?
library(dplyr)
library(ggplot2)
n = 1000
x = rnorm(n, mean = 0, sd = 0.5)
y = rnorm(n, mean = 0, sd = 0.5)
split = 0.05
df = data.frame(x = x, y = y)
df = df %>%
dplyr::filter(x < -1*split | x > split) %>%
mutate(
group = ifelse(x < split, 1, 2)
)
plot(df$x, df$y, col = factor(df$group))
silhouette_score = cluster::silhouette(df$group, dist = dist(df$x, df$y)) %>%
as.data.frame()
silhouette_score %>%
group_by(cluster) %>%
summarise(mean(sil_width))
ggplot(silhouette_score, aes(x = sil_width)) +
geom_histogram() +
facet_grid(~cluster)