1

I have a large dataframe (> 8 million rows), with observations of individuals and different sites. I'm interested in looking at the proximity of these sites to a few key location (1 location in 2014 and 2 locations in 2015).

To minimize the number of calculations (and speed things up), I've used dplyr to collapse all known locations to just a single representative site in each year, and then tried to to use the distGeo function to calculate the distance when the year matches.

dist <- df %>% 
  mutate(year = year(ts)) %>% #ts is the time stamp for each observation
  select(site, lat, lon, year) %>% 
  group_by(site, lat, lon, year) %>% 
  summarise(n=n()) %>% #if I stop after summarise, the data frame has been reduced to 93 observations
  mutate(dist1 = ifelse(year == "2014",
                        distGeo(c(-64.343043, 45.897932), #coordinates for key location in 2014
                                df[,c("lon", "lat")])/1000, 
                         NA_real_)) #I have a similar lines for the two key locations in 2015

Just running this part takes ~30 minutes, and the result is distance of 740.1656 km for every 2014 site. How can I fix this code to provide the correct distance, and, ideally, speed up the calculations?

EDIT:

As suggested below, here's the solution:

dist <- df %>% 
  mutate(year = year(ts)) %>%
  select(site, lat, lon, year) %>% 
  group_by(site, lat, lon, year) %>% 
  summarise(n=n()) %>% 
  mutate(dist1 = ifelse(year == "2014",
                     pmap_dbl(list(lon, lat),
                              ~distVincentyEllipsoid(c(-64.343043, 45.897932), 
                                                     c(.x, .y))/1000), 
                     NA_real_)
tnt
  • 1,149
  • 14
  • 24

1 Answers1

1

You can use purrr::pmap to do this quite quickly (as distGeo is not vectorised)...

library(tidyverse) #for dplyr and purrr
library(geosphere) #for distGeo

df <- data.frame(lat = 90*runif(100), lon = 90*runif(100)) #dummy data

dist <- df %>% 
  mutate(dist1 = pmap_dbl(list(lon, lat),     #pmap_dbl ensures output is vector of numbers
                          ~distGeo(c(-64.343043, 45.897932), 
                                   c(.x, .y)) / 1000))

You'll need to modify this to include the year and other variables that I have ignored.

The problem with your code was the use of the df[...] term inside a dplyr pipeline that began with df. Best to just work with bare variable names as above.

Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
  • Thanks @Andrew Gustar. Can you explain in a bit more detail what you mean by distGeo is not vectorized? I've done something similar in the past and haven't had the same issue. – tnt Jan 25 '19 at 18:10
  • 1
    @tnt The first two arguments of `distGeo` are 2-vectors (lon, lat) (or, I think, n*2 matrices), so you can't simply replace them with vectors and expect the function to produce a vector output, as you can with many R functions. Instead you need something like `pmap` or `mapply` to iterate through the two lat and lon vectors simultaneously. – Andrew Gustar Jan 25 '19 at 23:26