Deal with coordinates and huge datasets in R

Question

I am struggling a bit with two datasets containing coordinates of individuals and cell towers:

A first dataset on 9,459 individuals with 1,214 variables including their latitude and longitude in degrees.
a second dataset on 31,176 cell towers with 4 variables including their latitude and longitude in degrees, and range in meters.

I would like to determine whether an individual is in the range of at least one of the cell towers and create a dummy equal to 1 if it is the case.

However, due to the size of the datasets, I cannot merged them with the cross-join command. I tried using the geosphere package with the following command:

distm(c(df1$longitude, df2$latitude), c(df2$longitude, df2$latitude), fun= distHaversine)

Unfortunately, it does not work since the two datasets are not equally sized. Any idea of how to solve this issue?

Maybe you can make both datasets equal by appending `0` values to the smaller one until both are equal. — LocoGris, Mar 02 '19 at 16:19
I tried what you suggested however it does not work even with the `gc()` function and a loop. I always get the same error message about the memory size. — William L., Mar 04 '19 at 10:12

score 0 · Answer 1 · answered Aug 20 '19 at 21:20

Generally, this can be done much more efficiently to maximise RAM and processor usage and reduce overhead. However, if what you are trying to do is a one-time operation, below approach should be enough (takes around 5 minutes on a current notebook).

Helper function

# More info: https://github.com/RomanAbashin/distGeo_v
distGeo_v <- function(x, y, xx, yy) { 
    if(!"geosphere" %in% installed.packages())  {
        stop("The 'geosphere' package needs to be installed for this function to work.")
    }
    matrix(.Call("_inversegeodesic", 
                 as.double(x), as.double(y), as.double(xx), as.double(yy), 
                 as.double(6378137), 1/298.257223563, PACKAGE='geosphere'), 
           ncol = 3, byrow = TRUE)[,1]
}

Data

library(geosphere)
library(tidyverse)
set.seed(1702)

users <- tibble(userid = 1:10000,
                x = rnorm(10000, 16.3738, 5),
                y = rnorm(10000, 48.2082, 5))

towers <- tibble(lon = rnorm(35000, 16.3738, 10),
                 lat = rnorm(35000, 48.2082, 10),
                 range = runif(35000, 50, 10000))

Code

result <- NULL
for(i in 1:nrow(users)) {

    is_match <- users[i, 1:3] %>%
        tidyr::crossing(towers[, 1:3]) %>%
        filter(distGeo_v(x, y, lon, lat) <= range) %>%
        nrow() > 0

    result <- bind_rows(result, tibble(userid = users$userid[i],
                                       match = is_match))

}

Result

> head(result)
# A tibble: 6 x 2
  userid match
   <int> <lgl>
1      1 TRUE 
2      2 FALSE
3      3 FALSE
4      4 TRUE 
5      5 FALSE
6      6 FALSE

Now you can left_join the result to your original data.

score 0 · Answer 2 · answered Nov 01 '19 at 20:30

I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.

The function spatialrisk::points_in_circle() calculates the observations within radius from a center point. Note that distances are calculated using the Haversine formula. Since each element of the output is a data frame, purrr::map_dfr is used to row-bind them together:

library(tibble)
library(spatialrisk)
library(dplyr)

set.seed(1702)
users <- tibble(userid = as.character(1:10000),
                lon = rnorm(10000, 16.3738, 1),
                lat = rnorm(10000, 48.2082, 1))

towers <- tibble(lon = rnorm(35000, 16.3738, 1),
                 lat = rnorm(35000, 48.2082, 1))

# Users with tower within 200 meters
purrr::map2_dfr(users$lon, users$lat, 
                   ~points_in_circle(towers, .x, .y, radius = 200)[1,], 
                   .id = "userid") %>%
     mutate(inrange = ifelse(is.na(distance_m), FALSE, TRUE))

Deal with coordinates and huge datasets in R

2 Answers2

Helper function

Data

Code

Result