4

I have a set of points on a map, each with a given parameter value. I would like to:

  1. Cluster them spatially and ignore any clusters having fewer than 10 points. My df should have a column (Clust) for the cluster each point belongs to [DONE]
  2. Sub-cluster the parameter values within each cluster; add a column to my df (subClust) used to categorize each point by sub-cluster.

I don't know how to do the second part, except maybe with loops.

The image shows the set of spatially distributed points (top left) colour coded by cluster and sorted by parameter value in the top right plot. The bottom row shows clusters with >10 points (left) and facets for each cluster sorted by parameter value (right). It's these facets that I'd like to be able to colour code by sub-cluster according to a minimum cluster separation distance (d=1)

Any pointers/help appreciated. My reproducible code is below.

enter image description here

# TESTING
library(tidyverse)
library(gridExtra)

# Create a random (X, Y, Value) dataset
set.seed(36)
x_ex <- round(rnorm(200,50,20))
y_ex <- round(runif(200,0,85))
values <- rexp(200, 0.2)
df_ex <- data.frame(ID=1:length(y_ex),x=x_ex,y=y_ex,Test_Param=values)

# Cluster data by (X,Y) location
d = 4
chc <- hclust(dist(df_ex[,2:3]), method="single")

# Distance with a d threshold - used d=40 at one time but that changes...
chc.d40 <- cutree(chc, h=d) 
# max(chc.d40)

# Join results 
xy_df <- data.frame(df_ex, Clust=chc.d40)

# Plot results
breaks = max(chc.d40)
xy_df_filt <- xy_df %>% dplyr::group_by(Clust) %>% dplyr::mutate(n=n()) %>% dplyr::filter(n>10)# %>% nrow

p1 <- ggplot() +
  geom_point(data=xy_df, aes(x=x, y=y, colour = Clust)) +
  scale_color_gradientn(colours = rainbow(breaks)) +
  xlim(0,100) + ylim(0,100) 

p2 <- xy_df %>% dplyr::arrange(Test_Param) %>%
ggplot() +
  geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = Test_Param)) +
  scale_colour_gradient(low="red", high="green")

p3 <- ggplot() +
  geom_point(data=xy_df_filt, aes(x=x, y=y, colour = Clust)) +
  scale_color_gradientn(colours = rainbow(breaks)) +
  xlim(0,100) + ylim(0,100) 

p4 <- xy_df_filt %>% dplyr::arrange(Test_Param) %>%
ggplot() +
  geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = Test_Param)) +
  scale_colour_gradient(low="red", high="green") +
  facet_wrap(~Clust, scales="free")

grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)

THIS SNIPPET DOES NOT WORK - can't pipe within dplyr mutate() ...

# Second Hierarchical Clustering: Try to sub-cluster by Test_Param within the individual clusters I've already defined above
xy_df_filt %>% # This part does not work
  dplyr::group_by(Clust) %>% 
  dplyr::mutate(subClust = hclust(dist(.$Test_Param), method="single") %>% 
                  cutree(, h=1))

Below is a way around it using a loop - but I'd really rather learn how to do this using dplyr or some other non-loop method. An updated image showing the sub-clustered facets follows.

sub_df <- data.frame()
for (i in unique(xy_df_filt$Clust)) {
  temp_df <- xy_df_filt %>% dplyr::filter(Clust == i)
  # Cluster data by (X,Y) location
  a_d = 1
  a_chc <- hclust(dist(temp_df$Test_Param), method="single")

  # Distance with a d threshold - used d=40 at one time but that changes... 
  a_chc.d40 <- cutree(a_chc, h=a_d) 
  # max(chc.d40)

  # Join results to main df
  sub_df <- bind_rows(sub_df, data.frame(temp_df, subClust=a_chc.d40)) %>% dplyr::select(ID, subClust)
}
xy_df_filt_2 <- left_join(xy_df_filt,sub_df, by=c("ID"="ID"))

p4 <- xy_df_filt_2 %>% dplyr::arrange(Test_Param) %>%
ggplot() +
  geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = subClust)) +
  scale_colour_gradient(low="red", high="green") +
  facet_wrap(~Clust, scales="free")

grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)

enter image description here

val
  • 1,629
  • 1
  • 30
  • 56

2 Answers2

1

You could do this for your subclusters...

xy_df_filt_2 <- xy_df_filt %>% 
                group_by(Clust) %>% 
                mutate(subClust = tibble(Test_Param) %>% 
                                  dist() %>% 
                                  hclust(method="single") %>% 
                                  cutree(h=1))

Nested pipes are fine. I think the problem with your version was that you were not passing the right sort of object to dist. The tibble term is not needed if you are only passing a single column to dist, but I have left it in in case you want to use several columns as you do for the main clustering.

You could use the same sort of formula, but without the group_by, to calculate xy_df from df_ex.

Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
  • your tibble(x,y) should instead read as tibble(Test_Param) to be right, since the second clustering is based on Test_Param distances and not x,y. But your method works. Thx – val Apr 15 '18 at 06:43
  • Yes, of course - sorry about that. I've amended the answer. – Andrew Gustar Apr 15 '18 at 06:49
  • I get a bunch of warnings (Warning in mutate_impl(.data, dots) : binding character and factor vector, coercing into character vector) when I run this code and seems related to this issue (https://github.com/tidyverse/dplyr/issues/2911) but I couldn't resolve it; I wanted to convert subClust to factor using factor() or as.factor() and I wonder if tibble() is getting in the way. The answer by Camille doesn't have this issue. – val May 11 '18 at 00:30
  • @val Yes, that is (I think) just an indication that the `mutate` is having to add factor levels, which it does by converting to character. It is just a warning - I've had it with other things but it does not necessarily mean that the calculation doesn't work. – Andrew Gustar May 11 '18 at 08:08
1

There should be a way to do it using a combination of do and tidy, but I always have a hard time getting things to line up the way I want using do. Instead, what I usually do is combine split from base R and map_dfr from purrr. split will split the dataframe by Clust and give you a list of dataframes that you can then map over. map_dfr maps over each of those dataframes and returns a single dataframe.

I started from your xy_df_filt and generated what I believe should be the same as the xy_df_filt_2 that you got from the for loop. I made two plots, although the two sets of clusters are a little hard to see.

xy_df_filt_2 <- xy_df_filt %>%
    split(.$Clust) %>%
    map_dfr(function(df) {
        subClust <- hclust(dist(df$Test_Param), method = "single") %>% cutree(., h = 1)

        bind_cols(df, subClust = subClust)
    })

ggplot(xy_df_filt_2, aes(x = x, y = y, color = as.factor(subClust), shape = as.factor(Clust))) +
    geom_point() +
    scale_color_brewer(palette = "Set2")

Clearer with faceting

ggplot(xy_df_filt_2, aes(x = x, y = y, color = as.factor(subClust), shape = as.factor(Clust))) +
    geom_point() +
    scale_color_brewer(palette = "Set2") +
    facet_wrap(~ Clust)

Created on 2018-04-14 by the reprex package (v0.2.0).

camille
  • 16,432
  • 18
  • 38
  • 60
  • I think this is a nice answer too - using tools that I was not familiar with. Thank you. – val Apr 15 '18 at 06:20
  • see my comment in the answer by Andrew; your method generates no warnings whereas his does. – val May 11 '18 at 00:31