I am using R to perform hierarchical clustering to categorical data.
I am trying out different variables from my sample, in order to identify the ones that provide meaningful clustering results. However, I noticed that if I change the order of the data, the results are different. Is this due to the way hclust
works, or am I missing something?
For each trial I extract a certain number of columns (in the following example I used columns 3,28,50,14).
my.data.final <- data.frame(read.csv("C:\\Final dataset-for R.csv"))
library(dplyr)
my.data.final <- my.data.final %>% mutate_if(is.character,as.factor)
my.data.final <- my.data.final %>% mutate_if(is.integer,as.factor)
my.data.final$Age <- factor(my.data.final$Age, ordered = TRUE)
my.data3 <- my.data.final[,c(3,28,50,14)]
my.data3 <- na.exclude(my.data3, row.names=1)
complete.cases(my.data3)
library(cluster)
dist.gower <- daisy(my.data3, metric = "gower")
aggl.clust.c <- hclust(dist.gower, method = "complete")
plot(aggl.clust.c,
main = "Agglomerative, complete linkages")
When I change the order of the columns in the line:
my.data3 <- my.data.final[,c(3,28,50,14)]
I noticed that the dendrogram changes. Is it expected to happen with hclust
?
I have found that the line:
my.data.final$Age <- factor(my.data.final$Age, ordered = TRUE)
somehow affects the result but I am not quite sure why.