1

I am using R to perform hierarchical clustering to categorical data. I am trying out different variables from my sample, in order to identify the ones that provide meaningful clustering results. However, I noticed that if I change the order of the data, the results are different. Is this due to the way hclust works, or am I missing something?

For each trial I extract a certain number of columns (in the following example I used columns 3,28,50,14).

my.data.final <- data.frame(read.csv("C:\\Final dataset-for R.csv"))

library(dplyr)
my.data.final <- my.data.final %>% mutate_if(is.character,as.factor)
my.data.final <- my.data.final %>% mutate_if(is.integer,as.factor)
my.data.final$Age <- factor(my.data.final$Age, ordered = TRUE)

my.data3 <- my.data.final[,c(3,28,50,14)]
my.data3 <- na.exclude(my.data3, row.names=1)
complete.cases(my.data3)

library(cluster)
dist.gower <- daisy(my.data3, metric = "gower")
aggl.clust.c <- hclust(dist.gower, method = "complete")
plot(aggl.clust.c,
     main = "Agglomerative, complete linkages")

When I change the order of the columns in the line:

my.data3 <- my.data.final[,c(3,28,50,14)]

I noticed that the dendrogram changes. Is it expected to happen with hclust ? I have found that the line:

 my.data.final$Age <- factor(my.data.final$Age, ordered = TRUE)

somehow affects the result but I am not quite sure why.

Anna
  • 177
  • 13
  • Could you please add the code you're using to do hierarchical clustering? Also provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). It's hard to help you w/o it. Also, did you set the random seed? And what do you exactly mean when you say `the results are different`? In what way? Please provide examples. – Arienrhod Sep 20 '19 at 11:06
  • I am trying to create a minimal example with different data (I am not able to share the original) but I don't seem to get the same problem, I will look into it. – Anna Sep 20 '19 at 11:26
  • This could indicate that the problem lies somehow within your data, but I'm not really sure how. – Arienrhod Sep 20 '19 at 11:30
  • It has something to do with the ordered data, but I can't still figure out what. – Anna Sep 20 '19 at 12:14
  • Most algorithms are just heuristic, so it isn't surprising that permuting the input data would give you a different heuristic solution (especially if it is anagglomerative clustering algorithm, where order plays a prominent role). Unless there is an actual bug in your code you are trying to fix, this seems more like a question for [statistics.se] rather than Stack Overflow. – John Coleman Sep 20 '19 at 13:49

0 Answers0