2

Given a list of event locations:

event.coords <- data.frame(
    event.id = letters[1:5],
    lats = c(43.155, 37.804, 26.71, 35.466, 40.783),
    lons = c(-77.616,-122.271, -80.064, -97.513, -73.966))

and the centroids of locales (which happen to be ZIP codes, but could be states, countries, etc.):

locale.centroids <-
  data.frame(
    locale.id = 1:5,
    lats = c(33.449, 41.482, 40.778, 43.59, 41.736),
    lons = c(-112.074, -81.67, -111.888, -70.335, -111.834))

I would like to calculate how far each locale centroid is from the nearest event. My data contains 100,000 locales, so I need something computationally efficient.

Zoe
  • 27,060
  • 21
  • 118
  • 148
conflictcoder
  • 383
  • 1
  • 9

2 Answers2

1

A tidyverse strategy. For calculating geographical distance I am using package geosphere

event.coords <- data.frame(
  event.id = letters[1:5],
  lats = c(43.155, 37.804, 26.71, 35.466, 40.783),
  lons = c(-77.616,-122.271, -80.064, -97.513, -73.966))

locale.centroids <-
  data.frame(
    locale.id = 1:5,
    lats = c(33.449, 41.482, 40.778, 43.59, 41.736),
    lons = c(-112.074, -81.67, -111.888, -70.335, -111.834))

library(tidyverse)
library(geosphere)

event.coords %>% rowwise() %>%
  mutate(new =  map(list(c(lons, lats)), ~ locale.centroids %>% rowwise() %>%
                      mutate(dist = distGeo((c(lons, lats)), .x)) %>%
                      ungroup %>%
                      filter(dist == min(dist)) %>%
                      select(locale.id, dist)))  %>%
  ungroup() %>% unnest_wider(new)

#> # A tibble: 5 x 5
#>   event.id  lats   lons locale.id     dist
#>   <chr>    <dbl>  <dbl>     <int>    <dbl>
#> 1 a         43.2  -77.6         2  382327.
#> 2 b         37.8 -122.          3  953915.
#> 3 c         26.7  -80.1         2 1645206.
#> 4 d         35.5  -97.5         1 1355234.
#> 5 e         40.8  -74.0         4  432562.

Created on 2021-07-07 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • 1
    elegant, and easy to understand for anyone familiar with tidyverse. Unfortunately not very scalable to a large dataset – conflictcoder Jul 07 '21 at 18:02
0

After converting the two dataframes to sf objects you can use sf::st_nearest_feature to match the closest point features and sf::st_distance to find out the distance value.

I have had pretty good luck with the speed of the sf package on datasets that I work with, but they are admittedly smaller than 100k. Depending on how your data are stored, you could look into a database approach like postGIS.

library(sf)

event.coords.sf <- st_as_sf(event.coords, coords = c('lons', 'lats'), crs = 'EPSG: 4326')
locale.centroids.sf <- st_as_sf(locale.centroids, coords = c('lons', 'lats'), crs = 'EPSG: 4326')


nearest_id <- st_nearest_feature(event.coords.sf, locale.centroids.sf)

nearest_dist <- st_distance(event.coords.sf, 
                            locale.centroids.sf[nearest_id,], 
                            by_element = TRUE)

cbind(event.coords, nearest_id, nearest_dist)
#---
  event.id   lats     lons nearest_id  nearest_dist
1        a 43.155  -77.616          2  382326.7 [m]
2        b 37.804 -122.271          3  953914.6 [m]
3        c 26.710  -80.064          2 1645205.7 [m]
4        d 35.466  -97.513          1 1355233.5 [m]
5        e 40.783  -73.966          4  432561.5 [m]
nniloc
  • 4,128
  • 2
  • 11
  • 22