(Hierarchical) clustering detection with categorical variables in R using hclust

Question

I am trying to find a hierarchical pattern in categorical data that I have. The data is sort of like this (as I am not allowed to use the actual data, I created a similar problem that follows my own data's structure/question):

I want to investigate sleep and which factors are most important for sleep. As a result, I have a dataset of instances of individuals whose sleep was measured and variables which may influence their sleep. These factors include mattress, last caffeine intake, diet, and noise level. What I want to know: in which (hierarchical) order do these factors influence sleep? I.e. which one is most important and which one least? And is it possible to detect such a hierarchical pattern using categorical variables?

Here is the code I have tried so far in my own dataset, but adjusted for this dummy dataset:

#make dummy df
ind <- c('ind_a', 'ind_b', 'ind_c', 'ind_d', 'ind_e', 'ind_f', 'ind_g', 'ind_h', 'ind_i', 'ind_j')
mattress <- c('soft', 'hard', 'medium', 'hard', 'very soft', 'medium', 'very soft', 'soft', 'soft', 'hard')
caff <- c('no caffeine', 'morning', 'early afternoon', 'early afternoon', 'evening', 'evening', 'morning', 'morning', 'no caffeine', 'no caffeine')
diet <- c('omni', 'omni', 'omni', 'pesc', 'pesc', 'veg', 'veg', 'omni', 'pesc', 'veg')
noise <- c('very loud', 'loud', 'little sound', 'little sound', 'quiet', 'quiet', 'little sound', 'loud', 'quiet', 'little sound')
df <- data.frame(ind = ind, mattress = mattress, caff = caff, diet = diet, noise = noise)
    
#compute distance between variables
library(reshape2)
df$all <- paste(df$mattress, df$caff, df$diet, df$noise, sep = ' ') #since each individual is a specific combination of factors
df <- cbind(df, df[,2:5])
names(df)[7:10] <- c('mattresstype', 'diettype', 'caffeineintake', 'noiselevel') #so that names are not the same
df <- melt(df, id.vars = c('mattresstype', 'diettype', 'caffeineintake', 'noiselevel', 'ind', 'all'))
df$know <- 1 # 1 here indicates that the individual shows this specific trait
mtrx <- acast(df,all ~ value, value.var= "know")
mtrx[is.na(mtrx)] <- 0 # 0 then indicates that the person does not show this trait
rownames(mtrx) <- 1:nrow(mtrx)
dist <- dist(mtrx, method = 'binary', diag=T) #calculate distances binary since binary matrix
hc <- hclust(dist, method = 'ward.D') #Ward method clustering
plot(hc) #see the image below

This gives the following dendogram, from which one can in principle detect a hierarchy:

Each number is thus one individual and its unique combination of 'sleep influencers'. I hope from this dendogram to be able to infer, based on those combinations, which specific influencers are most important based on which individuals are grouped together and where they split in the dendogram.

However, I am not sure that this method will actually be able to inform me about this possible hierarchy in traits that affect sleeping quality (in this case). Could anyone inform me whether this is a proper method and or offer any other suggestions, or if you unlike me have come across a similar post share it?

Hopefully this was not too confusing and makes sense. For any questions, I'm happy to answer or offer more explanation/other code.

Many, many thanks and have a nice day :)

This question seems a bit out SO scopes, maybe it fits better [here](https://stats.stackexchange.com/), however maybe a decision tree could be ok for your purposes. — s__, Aug 17 '21 at 13:27

(Hierarchical) clustering detection with categorical variables in R using hclust

0 Answers0