3

I'm working with Euclidean Distance with a pair of dataset. First of all, my data.

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

My goal is to find, for each point in points, the smallest distance from all the center in centers, and append the center name to the points dataset (clearly the smallest one's), and make this procedure automatic.

So I started with the base:

#Euclidean distance
sqrt(sum((x-y)^2))

The fact that I have in my mind how it should work, but I cannot manage how to make it automatic.

  1. choose one row of points, and all the rows of centers
  2. calculate the Euclidean Distance between the row and each row of centers
  3. choose the smallest distance
  4. attach the label of the smallest distance
  5. repeat for the second row ... till the end of points

So I managed to do it manually, to have all the steps to make it automatic:

# 1.  
x = (points[1,2:3])   # select the first of points
y1 = (centers[1,1:2]) # select the first center
y2 = (centers[2,1:2]) # select the second center
y3 = (centers[3,1:2]) # select the third center
y4 = (centers[4,1:2]) # select the fourth center

# 2.
# then the distances
distances <- data.frame(distance = c(
                                    sqrt(sum((x-y1)^2)),
                                    sqrt(sum((x-y2)^2)),
                                    sqrt(sum((x-y3)^2)),
                                    sqrt(sum((x-y4)^2))),
                                    center = centers$center
                                    )

# 3.
# then I choose the row with the smallest distance
d <- distances[which(distances$distance==min(distances$distance)),]

# 4.
# last, I put the label near the point
cbind(points[1,],d)

# 5. 
# then I restart for the second point

The problem is that I cannot manage it automatically. have you got any idea to make this procedure automatic for each points of points? Furthermore, am I reinventing the wheel, i.e. does it exist a faster procedure (as a function) that I don't know?

s__
  • 9,270
  • 3
  • 27
  • 45

2 Answers2

2
centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

library(tidyverse)

points %>%
  mutate(c = list(centers)) %>%
  unnest() %>%                       # create all posible combinations of points and centers as a dataframe
  rowwise() %>%                      # for each combination
  mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>%   # calculate distance
  ungroup() %>%                                            # forget the grouping
  group_by(point, x_p, y_p) %>%                            # for each point
  summarise(closest_center = center[d == min(d)]) %>%      # keep the closest center
  ungroup()                                                # forget the grouping

# # A tibble: 4 x 4
#   point   x_p   y_p closest_center
#   <fct> <dbl> <dbl> <fct>         
# 1 p1      160     7 b             
# 2 p2      600    23 d             
# 3 p3      400    56 c             
# 4 p4      245    12 a
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Awesome! I've added `%>% data.frame()` in the end and `x <- ` in the beginning, to store it in a dataframe (not in the question, but I put this for the future readers). Thanks a lot. – s__ May 23 '18 at 13:12
  • Any idea about the fact that if I use a larger dataset I have this error: `Error in summarise_impl(.data, dots) :Expecting a single value: [extent=2].`? – s__ May 23 '18 at 13:23
  • No, I think that the problem is that sometimes there are two "d" in each point with the same value, so maybe it has some problem to choose the `min`, but I'm not sure. Surely everything works fine till the `summarize()` function. Any idea? – s__ May 23 '18 at 15:01
  • 1
    If that's the case we can fix it by using `summarise(closest_center = first(center[d == min(d)]))` to keep the first centre, or `summarise(closest_center = paste0(center[d == min(d)], collapse = "_"))` to keep both as a string. – AntoniosK May 23 '18 at 15:14
  • 1
    Great! I suppose I told you the wrong error (my bad, sorry): in the next answer it's pointed out that the problem is that there are some duplicate row. However you've anserwed perfectly to everything, thanks. – s__ May 24 '18 at 07:23
2

With the dplyr package, you can use group_by to loop over each point and mutate to form a list of distances, set distance as the min of the list, and set center as the name of the minimum distance center. I've included two alternatives for the cases of duplicate rows or point names.

    library(dplyr)
   centers <- data.frame(x_ce = c(300,180,450,500),
                        y_ce = c(23,15,10,20),
                        center = c('a','b','c','d'))
   points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 245),
                       y_p = c(7,23,56,12, 12))
#
#  If duplicate rows need to be removed
#
  result1 <- points %>% group_by(point) %>%  distinct() %>% 
                                  mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                  distance=min(unlist(lst)), 
                                  center = centers$center[which.min(unlist(lst))]) %>%
             select(-lst)

which gives the result

# A tibble: 4 x 5
# Groups:   point [4]
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a 

and

#
# Alternative if point names are not unique
#
  points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 550),
                       y_p = c(7,23,56,12, 25))
  result2 <- points %>% rowwise() %>%
                    mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                               distance=min(unlist(lst)), 
                              center = centers$center[which.min(unlist(lst))]) %>%
                    ungroup() %>% select(-lst)

with the result

# A tibble: 5 x 5
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a     
5 p4      550    25     50.2 d    
WaltS
  • 5,410
  • 2
  • 18
  • 24
  • Any idea that if I use a larger dataset I have this error `Error in summarise_impl(.data, dots) : Column `x_p` must be length 1 (a summary value), not 4`? – s__ May 23 '18 at 13:25
  • 1
    It looks like you have rows with duplicate point names in points. if you're sure that these rows are complete duplicates, you can remove them with distinct as I've shown above. You should examine points to verify that this is the problem. – WaltS May 23 '18 at 15:49
  • Great, this works fine for the wider data. Thanks again. – s__ May 24 '18 at 07:24