1

I am trying to assign points into groupings based on Euclidean distance. For example, in the data below there are three points that represent three different groups (One, Two, Three, the non-green points on the figure). I would like to assign the remaining points (Scatter the green points) into a grouping based on the minimum Euclidean distance (i.e. change Scatter to the closest of the One Two or Three points.

I was trying to do this outside of kmeans or other clustering function and simply use the minimum Euclidean distance, but welcome and appreciate suggestions.

set.seed(123)
Data <- data.frame(
  x = c(c(3,5,8), runif(20, 1, 10)),
  y = c(c(3,5,8), runif(20, 1, 10)),
  Group = c(c("One", "Two", "Three"), rep("Scatter", 20))
)

ggplot(Data, aes(x, y, color = Group)) +
  geom_point(size = 3) +
  theme_bw()

enter image description here

B. Davis
  • 3,391
  • 5
  • 42
  • 78

1 Answers1

2

What about something like this:

bind_cols(
    Data,
    dist(Data %>% select(-Group)) %>%              # Get x/y coordinates from Data
        as.matrix() %>%                            # Convert to full matrix
        as.data.frame() %>%                        # Convert to data.frame
        select(1:3) %>%                            # We're only interested in dist to 1,2,3
        rowid_to_column("pt") %>%                  
        gather(k, v, -pt) %>%
        group_by(pt) %>%
        summarise(k = k[which.min(v)])) %>%        # Select label with min dist
    mutate(Group = factor(Group, levels = unique(Data$Group))) %>%
    ggplot(aes(x, y, colour = k, shape = Group)) +
    geom_point(size = 3)

enter image description here

Explanation: We calculate all pairwise Euclidean distances using dist between One, Two, Three and all Scatter points. We then assign every Scatter point a label k based on its minimal distance to One (k = 1), Two (k = 2), Three (k = 3).

Note that the Scatter point at (9.6, 3.1) is indeed correctly "classified" as belonging to Two (k = 2); you can confirm this by adding coord_fixed() in the ggplot plot chain.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Very helpful, although I am not sure how (or where) to modify your code to get a new data frame with the desired result, although what is plotted looks perfect. – B. Davis Sep 17 '18 at 03:01
  • @B.Davis That should be fairly straightforward. `Data` is your original `data.frame`. All I do is column-bind `Data` and the cluster information from `dist` into a new `data.frame` which is then passed on to `ggplot`. – Maurits Evers Sep 17 '18 at 03:08
  • @B.Davis PS. I've added some comments in the code that hopefully help. – Maurits Evers Sep 17 '18 at 03:11