I am building a clustering algorithm for use with data I have not yet seen, so I'm using some psuedo data in the mean time. The results from PAM show that I do not have any isolated clusters but the ggplot using TSNE shows I have well-formed clusters. I suspect this is due to my fake data. Does anyone have any thoughts as to why this would be?
Here is the data and please note, Age and howOld represent different things:
library(dplyr)
library(cluster)
library(Rtsne)
library(ggplot2)
set.seed(1987)
n = 350
clust_dat <-
data.frame(personId = 1:n,
networkPref = sample(c("topic", "jobtitle", "orgtype"),
size = n, replace = TRUE,
prob = c(0.56, 0.20, 0.24)),
Age = sample(23:65, size = n, replace = TRUE),
familyImp = sample(c(1, 2, 3, 4, 5), size = n, replace = TRUE,
prob = c(0.02, 0.01, 0.10, 0.4, 0.83)),
howOld = sample(25:30, size = n, replace = TRUE,
prob = c(.40, .30, .20, .05, .03, .02)),
horror = sample(c("Yes", "No"), size = n, replace = TRUE,
prob = c(0.27, 0.73)),
sailBoat = sample(c("Yes", "No"), size = n, replace = TRUE,
prob = c(0.58, 0.42)))
Here is my model build after first defining levels of my ordinal variable
clust_dat$familyImp <- factor(clust_dat$familyImp,
levels = c("1", "2", "3", "4", "5"),
ordered = TRUE)
gower_dist <- daisy(clust_dat[, -1], metric = "gower")
gower_matrix <- as.matrix(gower_dist)
#find silhouette width for many PAM models
sil_width <- c(NA)
for (i in 2:ceiling(nrow(clust_dat)/9)) {
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
#build PAM model with best silhouette width
pam_fit <- pam(gower_dist, diss = TRUE, k = which.max(sil_width))
When getting isolation info on the PAM, I get:
pam_fit$isolation
1 2 3 4 5 6 7 8 9 10 11 12
no no no no no no no no no no no no
Levels: no L L*
But plotting shows Some Well Formed Clusters
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <-
tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = clust_dat$personId)
ggplot(tsne_data, aes(x = X, y = Y)) +
geom_point(aes(color = cluster))
Any ideas? If I remove all continuous variables I get very non-defined clusters but some are recognized as isolated...