3

I have a data set consisting of the daily water intake for some mice belonging to 4 different genotypes. I am trying to write a script in order to classify these animals according to their pattern of water intake using a hierarchical cluster analysis and then create a longitudinal graph plotting the average water intake per cluster across days.

For doing that, I am first creating the hierarchical cluster cluster as follows:

library("dendextend")
library("ggplot2")
library("reshape2")
data=read.csv("data.csv", header=T, row.names=1)
trimmed=data[, -ncol(data)]

 hc <- as.dendrogram(hclust(dist(trimmed)))
    labels.drk=data[,ncol(data)]
    groups.drk=labels.drk[order.dendrogram(hc)]
    genotypes=as.character(unique(data[,ncol(data)]))
    k=4
    cluster_cols=rainbow(k)

    hc <- hc %>%
      color_branches(k = k, col=cluster_cols) %>%

      set("branches_lwd", 1) %>%

      set("leaves_pch", rep(c(21, 19), length(genotypes))[groups.drk]) %>% 
      set("leaves_col", palette()[groups.drk]) 

    plot(hc, main="Total animals" ,horiz=T)

    legend("topleft", legend=genotypes,
           col=palette(), pch = rep(c(21,19), length(genotypes)),
           title="Genotypes")

    legend("bottomleft", legend=1:k,
           col=cluster_cols, lty = 1, lwd = 2,
           title="Drinking group")

And then I am using the cutree function to assess which animal belong to which group in order to plot the water intake average per group.

groups<-cutree(hc, k=k, order_clusters_as_data = FALSE))
x<-cbind(data,groups)
intake_avg=aggregate(data[, -ncol(data)], list(x$groups), mean, header=T)

df <- melt(intake_avg, id.vars = "Group.1")
ggplot(df, aes(variable, value, group=factor(Group.1))) + geom_line(aes(color=factor(Group.1)))

The problem is that I am having an incongruity between the numbers I get from the hierarchical cluster an the number assigned by the cutree function. While the cluster is ordering the branches bottom up from 1 to 4, the cutree function is using some other ordering parameter which I am not familiar with. Because of that, the labels in the cluster plot and in the intake graph plot don't match.

I am very beginner in coding, so for sure I am using too many redundant lines and loops and so my code could be shortened, but if you guys could help me figure out this specific issue I would be very glad.

Data set

Cluster: Cluster

Intake graph Intake graph

1 Answers1

2

To get the same clusters plotted in the dendrogram, you need to use:

groups <- dendextend:::cutree(hc, k=k, order_clusters_as_data = FALSE)
idx <- match(rownames(data), names(groups))
x <- cbind(data,groups[idx])
intake_avg <- aggregate(data[, -ncol(data)], list(x$groups), mean, header=T)

df <- melt(intake_avg, id.vars = "Group.1")
ggplot(df, aes(variable, value, group=factor(Group.1))) + 
 geom_line(aes(color=factor(Group.1)), lwd=1)

Here is the intake graph:

enter image description here

Marco Sandri
  • 23,289
  • 7
  • 54
  • 58
  • Hi Marco,Thanks for the reply, but I am still getting wrong groups. Now the groups are been classified based only on the order in which they appears in the original data set and not more based on the deprogram. Any clue? – Daniel da Silva Jan 02 '18 at 14:28
  • Hi Marco, I am using the same previous code that I have posted but adding that line you suggested. Beforehand, the classification was right, but the assigned numbers were flipped, now the clusters created by the cutree are different from the deprogram. That is the code I am using now: https://pastebin.com/xYFxQrbb. – Daniel da Silva Jan 02 '18 at 17:14
  • Great, Marco! Thanks a lot. Very elegant approach. Actually that also was useful to fix another error I was getting in my code. Thank you very much – Daniel da Silva Jan 02 '18 at 18:47