R, compute the smallest Euclidean Distance for two dataset, and label it automatically

Question

I'm working with Euclidean Distance with a pair of dataset. First of all, my data.

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

My goal is to find, for each point in points, the smallest distance from all the center in centers, and append the center name to the points dataset (clearly the smallest one's), and make this procedure automatic.

So I started with the base:

#Euclidean distance
sqrt(sum((x-y)^2))

The fact that I have in my mind how it should work, but I cannot manage how to make it automatic.

choose one row of points, and all the rows of centers
calculate the Euclidean Distance between the row and each row of centers
choose the smallest distance
attach the label of the smallest distance
repeat for the second row ... till the end of points

So I managed to do it manually, to have all the steps to make it automatic:

# 1.  
x = (points[1,2:3])   # select the first of points
y1 = (centers[1,1:2]) # select the first center
y2 = (centers[2,1:2]) # select the second center
y3 = (centers[3,1:2]) # select the third center
y4 = (centers[4,1:2]) # select the fourth center

# 2.
# then the distances
distances <- data.frame(distance = c(
                                    sqrt(sum((x-y1)^2)),
                                    sqrt(sum((x-y2)^2)),
                                    sqrt(sum((x-y3)^2)),
                                    sqrt(sum((x-y4)^2))),
                                    center = centers$center
                                    )

# 3.
# then I choose the row with the smallest distance
d <- distances[which(distances$distance==min(distances$distance)),]

# 4.
# last, I put the label near the point
cbind(points[1,],d)

# 5. 
# then I restart for the second point

The problem is that I cannot manage it automatically. have you got any idea to make this procedure automatic for each points of points? Furthermore, am I reinventing the wheel, i.e. does it exist a faster procedure (as a function) that I don't know?

score 2 · Accepted Answer · answered May 23 '18 at 11:29

2

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

library(tidyverse)

points %>%
  mutate(c = list(centers)) %>%
  unnest() %>%                       # create all posible combinations of points and centers as a dataframe
  rowwise() %>%                      # for each combination
  mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>%   # calculate distance
  ungroup() %>%                                            # forget the grouping
  group_by(point, x_p, y_p) %>%                            # for each point
  summarise(closest_center = center[d == min(d)]) %>%      # keep the closest center
  ungroup()                                                # forget the grouping

# # A tibble: 4 x 4
#   point   x_p   y_p closest_center
#   <fct> <dbl> <dbl> <fct>         
# 1 p1      160     7 b             
# 2 p2      600    23 d             
# 3 p3      400    56 c             
# 4 p4      245    12 a

answered May 23 '18 at 11:29

AntoniosK

15,991
2
19
32

Awesome! I've added `%>% data.frame()` in the end and `x <- ` in the beginning, to store it in a dataframe (not in the question, but I put this for the future readers). Thanks a lot. – s__ May 23 '18 at 13:12
Any idea about the fact that if I use a larger dataset I have this error: `Error in summarise_impl(.data, dots) :Expecting a single value: [extent=2].`? – s__ May 23 '18 at 13:23
No, I think that the problem is that sometimes there are two "d" in each point with the same value, so maybe it has some problem to choose the `min`, but I'm not sure. Surely everything works fine till the `summarize()` function. Any idea? – s__ May 23 '18 at 15:01
1

If that's the case we can fix it by using `summarise(closest_center = first(center[d == min(d)]))` to keep the first centre, or `summarise(closest_center = paste0(center[d == min(d)], collapse = "_"))` to keep both as a string. – AntoniosK May 23 '18 at 15:14
1

Great! I suppose I told you the wrong error (my bad, sorry): in the next answer it's pointed out that the problem is that there are some duplicate row. However you've anserwed perfectly to everything, thanks. – s__ May 24 '18 at 07:23

WaltS · Answer 2 · 2018-05-24T18:52:26.193

With the dplyr package, you can use group_by to loop over each point and mutate to form a list of distances, set distance as the min of the list, and set center as the name of the minimum distance center. I've included two alternatives for the cases of duplicate rows or point names.

    library(dplyr)
   centers <- data.frame(x_ce = c(300,180,450,500),
                        y_ce = c(23,15,10,20),
                        center = c('a','b','c','d'))
   points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 245),
                       y_p = c(7,23,56,12, 12))
#
#  If duplicate rows need to be removed
#
  result1 <- points %>% group_by(point) %>%  distinct() %>% 
                                  mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                  distance=min(unlist(lst)), 
                                  center = centers$center[which.min(unlist(lst))]) %>%
             select(-lst)

which gives the result

# A tibble: 4 x 5
# Groups:   point [4]
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a

and

#
# Alternative if point names are not unique
#
  points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 550),
                       y_p = c(7,23,56,12, 25))
  result2 <- points %>% rowwise() %>%
                    mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                               distance=min(unlist(lst)), 
                              center = centers$center[which.min(unlist(lst))]) %>%
                    ungroup() %>% select(-lst)

with the result

# A tibble: 5 x 5
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a     
5 p4      550    25     50.2 d

Any idea that if I use a larger dataset I have this error `Error in summarise_impl(.data, dots) : Column `x_p` must be length 1 (a summary value), not 4`? — s__, May 23 '18 at 13:25
It looks like you have rows with duplicate point names in points. if you're sure that these rows are complete duplicates, you can remove them with distinct as I've shown above. You should examine points to verify that this is the problem. — WaltS, May 23 '18 at 15:49

R, compute the smallest Euclidean Distance for two dataset, and label it automatically

2 Answers2