determine closest site and select another variable from that location

Question

I have study sites where I've collected data and nearby weather stations with information on temperature and precipitation. I'd like to pair my daily data at the study sites with weather information from the closest weather station. I think, to do this, I need a two step process where I first pick the closest weather station to the study site, and then I create a new variable with the weather data.

Here's a snapshot of my data:

# study sites
site <- rep(LETTERS[1:3], 5)
siteLat <- rep(c(41, 42, 44), 5)
siteLon <- rep(c(68, 62, 63), 5)
siteDate <- rep(1:5, 3)
dfSites <- data.frame(cbind(site, siteLat, siteLon, siteDate))

# weather stations
station <- rep(letters[1:3], 5)
stationLat <- rep(c(40, 43, 45), 5)
stationLon <- rep(c(67, 61, 64), 5)
stationDate <- rep(1:5, 3)
temp <- sample(10:20, 15, replace=TRUE)
dfStation <- data.frame(cbind(station, stationLat, stationLon, stationDate, temp))

I'm trying to use this line to determine which station is closest, but I only get a single row of distances.

distVincentyEllipsoid(df2[c("recvLon", "recvLat")], weather[c("lon", "lat")])

I'm a little unsure about the next steps once all the distances are calculated, but I think I'd need something to select the closest station and the match up dates. This is the best I've come up with:

dfSites %>% 
    mutate(closestStation = ???,
           temp1 = temp[station == closestStation & stationDate == siteDate])

The final result is my study site dataframe with an additional column for temperature from the closest weather station.

BTW: you should never need to use `data.frame(cbind(...))`, just do `data.frame(...)`; at best, it is an inefficient no-op, at worst it might munge your data in difficult-to-troubleshoot ways. — r2evans, Jul 16 '19 at 17:35
I think `distVincentyEllipsoid(p1, p2, ...)` tries to find the distance between the first point of `p1` with the first point of `p2`, second of `p1` with second of `p2`, etc. What you need is an expansion along the lines of *"first in `p1` against all of `p2`, second in `p1` with all of `p2`, etc). — r2evans, Jul 16 '19 at 17:41

score 0 · Answer 1 · edited Jul 17 '19 at 16:47

I think distVincentyEllipsoid(p1, p2, ...) tries to find the distance between the first point of p1 with the first point of p2, second of p1 with second of p2, etc. What you need is an expansion along the lines of *"first in p1 against all of p2, second in p1 with all of p2, etc).

Adjusting your code to call dfSites and dfStation (instead of df2/weather), the following should work for you. (I'm going to remove one of the stations with dfStation[-1,...] just for clearly recognizing which of the dimensions represents sites versus stations.

alldists <- sapply(seq_len(nrow(dfSites)), function(i) {
  distVincentyEllipsoid(dfSites[i,c("siteLon","siteLat")],
                        dfStation[-1,c("stationLon","stationLat")])
})
alldists
#           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
#  [1,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1
#  [2,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4
#  [3,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7
#  [4,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1
#  [5,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4
#  [6,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7
#  [7,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1
#  [8,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4
#  [9,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7
# [10,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1
# [11,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4
# [12,] 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7
# [13,] 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1
# [14,] 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4
#           [,9]    [,10]    [,11]    [,12]    [,13]    [,14]    [,15]
#  [1,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0
#  [2,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2
#  [3,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5
#  [4,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0
#  [5,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2
#  [6,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5
#  [7,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0
#  [8,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2
#  [9,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5
# [10,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0
# [11,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2
# [12,] 484015.5 119427.7 565573.7 484015.5 119427.7 565573.7 484015.5
# [13,] 228960.0 786180.9 123505.1 228960.0 786180.9 123505.1 228960.0
# [14,] 122086.2 481351.6 269760.4 122086.2 481351.6 269760.4 122086.2

(Because we have 14 rows, each row is one of your stations. You should not do the [-1,] indexing, just know which is the row/column.) From this, we know that the different between site A and station b is 481351.6 meters (first column, second row).

From here, just find the column-minimum:

apply(alldists, 2, which.min)
#  [1] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2

suggesting that the closest station to site A is b (which.min will return the first minimum, it does not indicate ties).

Now, dfStation[apply(alldists, 2, which.min),] gives you 15 rows of station data that can easily be cbinded or otherwise combined with dfSites.

dplyr options:

dfSites %>%
  mutate(
    station_i = purrr::map2_int(
      siteLat, siteLon,
      ~ which.min(geosphere::distVincentyEllipsoid(
          cbind(.x,.y), dfStation[-1,c("stationLon","stationLat")]))
      ),
    station = as.character(dfStation$station)[ station_i ]
  )
#    site siteLat siteLon siteDate station_i station
# 1     A      41      68        1         3       c
# 2     B      42      62        2         1       a
# 3     C      44      63        3         2       b
# 4     A      41      68        4         3       c
# 5     B      42      62        5         1       a
# 6     C      44      63        1         2       b
# 7     A      41      68        2         3       c
# 8     B      42      62        3         1       a
# 9     C      44      63        4         2       b
# 10    A      41      68        5         3       c
# 11    B      42      62        1         1       a
# 12    C      44      63        2         2       b
# 13    A      41      68        3         3       c
# 14    B      42      62        4         1       a
# 15    C      44      63        5         2       b

A slight (10-15%) speed improvement can be seen by doing an outer-product of them.

outer(seq_len(nrow(dfSites)), seq_len(nrow(dfStation)),
      function(i,j) geosphere::distVincentyEllipsoid(dfSites[i,2:3], dfStation[j,2:3]))

That also returns an mxn matrix (station rows) that you then apply(...) across to get the closest index. (I was hoping for a larger performance gain, since distVincentyEllipsoid is called only once ...)

Thanks @r2evans. I'm having a hard time following this code (perhaps because I mostly use dplyr, so I'm unfamiliar with this syntax); can you clarify the following? 1) why is the sappy necessary? 2) why is there a -1 after weather[? 3) the apply(alldists... only returns a column number not the name of the station. 4) how does this help me match up the weather on a specific date with the data at my study site for that date? — tnt, Jul 16 '19 at 18:22
(1) `sapply` is needed because we want to compare `p1[1,]` against all of `p2`, then `p1[2,]` against all of `p2`, etc. It happens to return a `matrix` which we can aggregate column-wise. (2) Read the answer, I specifically mentioned `[-1,]`, you don't need the `-1`. (3) The column number is then used (as I demo) to index on the stations frame; you can always do `dfStation$station[apply(alldists, 2, which.min)]` to get the station name. (4) You can then `cbind` the station data with the sites, as I showed in the last paragraph. — r2evans, Jul 16 '19 at 18:38
Thanks @r2evans. I'm still unclear about the purpose of the -1; when I remove it, I get an error message: object 'i' not found. I also don't think a simple cbind will work to join the two df because the dates also have to match up. I've tried the dplyr option, but that also doesn't seem to be working correctly (although for me it is easier to follow). In this case, the station that is being identified as closest, isn't always the closest one based on my own calculations; the difference is often quite large (e.g., 20 km) between the site that is selected and the actual closest one. — tnt, Jul 17 '19 at 14:58

determine closest site and select another variable from that location

1 Answers1