How to add labels to original data given clustering result using hclust

Question

Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:

library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)

If I create an hclust object with a sample of 60 from the original dataset like so:

n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')

I know that I can then add the labels to the dendrogram for viewing like this:

observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")

However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.

It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.

Can anyone help with extracting this information?

RLave · Answer 1 · 2018-10-17T09:46:51.517

0

You need to define a level where you cut your dendrogram, this will form the groups.

Use:

labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels

Let's look at the dendrogram in order to find the best number for k:

plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups

It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).

Use the horizontal lines at that height and count the cluster "formed".

edited Oct 17 '18 at 09:46

answered Oct 17 '18 at 09:08

RLave

8,144
3
21
37

Thanks for responding but this doesn't work. `hc$labels` returns `NULL` and `ts.samp$cluster <- hc$labels` doesn't do anything. – pd441 Oct 17 '18 at 09:24
thats weird as there's no error for me. what's the error that you get? – pd441 Oct 17 '18 at 09:31
I'm sorry, I fixed my answer now. – RLave Oct 17 '18 at 09:47

How to add labels to original data given clustering result using hclust

1 Answers1