I have a large dataframe (> 8 million rows), with observations of individuals and different sites. I'm interested in looking at the proximity of these sites to a few key location (1 location in 2014 and 2 locations in 2015).
To minimize the number of calculations (and speed things up), I've used dplyr to collapse all known locations to just a single representative site in each year, and then tried to to use the distGeo function to calculate the distance when the year matches.
dist <- df %>%
mutate(year = year(ts)) %>% #ts is the time stamp for each observation
select(site, lat, lon, year) %>%
group_by(site, lat, lon, year) %>%
summarise(n=n()) %>% #if I stop after summarise, the data frame has been reduced to 93 observations
mutate(dist1 = ifelse(year == "2014",
distGeo(c(-64.343043, 45.897932), #coordinates for key location in 2014
df[,c("lon", "lat")])/1000,
NA_real_)) #I have a similar lines for the two key locations in 2015
Just running this part takes ~30 minutes, and the result is distance of 740.1656 km for every 2014 site. How can I fix this code to provide the correct distance, and, ideally, speed up the calculations?
EDIT:
As suggested below, here's the solution:
dist <- df %>%
mutate(year = year(ts)) %>%
select(site, lat, lon, year) %>%
group_by(site, lat, lon, year) %>%
summarise(n=n()) %>%
mutate(dist1 = ifelse(year == "2014",
pmap_dbl(list(lon, lat),
~distVincentyEllipsoid(c(-64.343043, 45.897932),
c(.x, .y))/1000),
NA_real_)