0

Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:

library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)

If I create an hclust object with a sample of 60 from the original dataset like so:

n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')

I know that I can then add the labels to the dendrogram for viewing like this:

observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")

However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.

It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.

Can anyone help with extracting this information?

pd441
  • 2,644
  • 9
  • 30
  • 41

1 Answers1

0

You need to define a level where you cut your dendrogram, this will form the groups.

Use:

labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels

Let's look at the dendrogram in order to find the best number for k:

plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups

enter image description here

It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).

Use the horizontal lines at that height and count the cluster "formed".

RLave
  • 8,144
  • 3
  • 21
  • 37