0

I am building a clustering algorithm for use with data I have not yet seen, so I'm using some psuedo data in the mean time. The results from PAM show that I do not have any isolated clusters but the ggplot using TSNE shows I have well-formed clusters. I suspect this is due to my fake data. Does anyone have any thoughts as to why this would be?

Here is the data and please note, Age and howOld represent different things:

library(dplyr)
library(cluster)
library(Rtsne)
library(ggplot2)

set.seed(1987)
n = 350
clust_dat <- 
data.frame(personId = 1:n,
         networkPref = sample(c("topic", "jobtitle", "orgtype"),
                             size = n, replace = TRUE,
                             prob = c(0.56, 0.20, 0.24)),
         Age = sample(23:65, size = n, replace = TRUE),
         familyImp = sample(c(1, 2, 3, 4, 5), size = n, replace = TRUE, 
                            prob = c(0.02, 0.01, 0.10, 0.4, 0.83)),
         howOld = sample(25:30, size = n, replace = TRUE,
                         prob = c(.40, .30, .20, .05, .03, .02)),
         horror = sample(c("Yes", "No"), size = n, replace = TRUE, 
                         prob = c(0.27, 0.73)),
         sailBoat = sample(c("Yes", "No"), size = n, replace = TRUE, 
                           prob = c(0.58, 0.42)))

Here is my model build after first defining levels of my ordinal variable

clust_dat$familyImp <- factor(clust_dat$familyImp, 
                          levels = c("1", "2", "3", "4", "5"), 
                          ordered = TRUE)

gower_dist <- daisy(clust_dat[, -1], metric = "gower")
gower_matrix <- as.matrix(gower_dist)

#find silhouette width for many PAM models
sil_width <- c(NA)
for (i in 2:ceiling(nrow(clust_dat)/9)) {
   pam_fit <- pam(gower_dist, 
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

#build PAM model with best silhouette width
pam_fit <- pam(gower_dist, diss = TRUE, k = which.max(sil_width))

When getting isolation info on the PAM, I get:

pam_fit$isolation

 1  2  3  4  5  6  7  8  9 10 11 12 
no no no no no no no no no no no no 
Levels: no L L*

But plotting shows Some Well Formed Clusters

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- 
  tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = clust_dat$personId)

ggplot(tsne_data, aes(x = X, y = Y)) +
  geom_point(aes(color = cluster))

Any ideas? If I remove all continuous variables I get very non-defined clusters but some are recognized as isolated...

user111417
  • 143
  • 1
  • 9

1 Answers1

0

The way you generate the data, it should not have any clusters beyond what is an artifact from the categoricial labels you used. Based on the frequencies you used, I would have expect 8 "clusters" corresponding to the trivial combinations of the attributes.

If you generate i.i.d. data, it is not supposed to cluster!

So I'd rather assume your visualization is the problem.

See, e.g., this answer on the problems of "seeing" clusters in tSNE.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Not too sure what you mean by all of this. How are my data i.i.d. Some may be, but all? How do the frequencies I use lead to an expectation of 8 clusters? – user111417 Aug 20 '17 at 20:10
  • Your samples are independent of each other. There is no clustering in the data generation, where some object A would make similar object more likely to be sampled. Your samples are A) independent of each other, and B) drawn from the identical distribution. – Has QUIT--Anony-Mousse Aug 20 '17 at 21:25
  • I understand that they are independent but not that they are identically distributed. Also, would you please elaborate on the expectation of 8 clusters? I think if we were to cluster only on the categoricals we should expect 12? – user111417 Aug 20 '17 at 22:14
  • They are all drawn from the same distribution, aren't they? The parameters of the `sample` calls are the same for all points. I believe you had a typo before, which caused one category to be too rare, reducing 12 to 8. – Has QUIT--Anony-Mousse Aug 21 '17 at 05:16
  • Ahh, yes I see. You are right. I had a lapse of reason. How do you suggest changing the data to be non iid? – user111417 Aug 21 '17 at 05:42