1

I have two datasets, one with 488,286 rows and longitude and latitude coordinates and a second with 245,077 rows and longitude and latitiude coordinates. The second also has additional data relating for the coordinates. I want to find the closest points in the second dataset to all of those in the first. I cannot share the raw data, so for the sake of simplicity I will generate some random points here:

df1<-cbind(runif(488286,min=-180, max=-120), runif(488286, min=50, max=85))
df2<-cbind(runif(245077,min=-180, max=-120), runif(245077, min=50, max=85))

I tried just using the distm function but the data was too large, so I then tried to break it down like this:

library(geosphere)

closest<-apply(df1, 1, function(x){
    mat<-distm(x, df2, fun=distVincentyEllipsoid)
    return(which.min(mat))
})

I think this works but it takes so long to run that I haven't actually seen the results (only tried with a subset of the data). I really need a quicker way of doing this as I left it running for 2 days and it did not finish. It doesn't have to be using distm, just anything that is quicker and accurate.

Thanks in advance!

user5481267
  • 117
  • 1
  • 15
  • Yes @Parfait that is quick to do. I will be running this on a hi memory queue on a server so in theory should have a lot of RAM available – user5481267 Apr 01 '19 at 14:35
  • 1
    First, please include all `library` lines for non-base R functions. Curious, does the [distHaversine](https://github.com/cran/geosphere/blob/master/R/distHaversine.R) run faster than [distVincentyEllipsoid](https://github.com/cran/geosphere/blob/master/R/distVincentyEllipsoid.R)? As you can see the latter runs with nested `for` and `while` loops. – Parfait Apr 01 '19 at 15:05
  • @Parfait ah sorry, added the relevant library now. I think it does, I didn't realise it would make much of a difference. But for just one of my locations in df 1 the Haversine one takes 0.180 seconds and the distVincentyEllipsoid one takes 53.989 seconds, so that's quite a big difference. So perhaps the answer is as simple as changing the argument – user5481267 Apr 01 '19 at 15:17

1 Answers1

1

Maybe this works for you:

library(sf)
library(RANN)


df1<-data.frame("lon" = runif(2000,min=-180, max=-120), "lat" = runif(2000, min=50, max=85))
df2<-data.frame("lon" = runif(1430,min=-180, max=-120), "lat" = runif(1430, min=50, max=85))



df1_sf <- st_as_sf(df1, coords = c("lon", "lat"), 
         crs = 4326, agr = "constant")

df2_sf <- st_as_sf(df2, coords = c("lon", "lat"), 
                   crs = 4326, agr = "constant")

nearest <- nn2(df2_sf, df1_sf, k = 1, treetype = 'bd', searchtype = 'priority')

df2_sf[nearest$nn.idx,]



RANN is a wrapper for a nn-library from c++, so it should be pretty quick. I nevertheless reduced the amount of points for this answer.

First I converted df1and df2to sf-objects. I then fed them to the nn2-algorithm, which is a knn-algorithm and returns a list. The vector nn.idx inside the list contains the index for the closest point in df2 for each point in df1.

UPDATE: You can also parallelize

library(parallel)

c4 <- parallel::makeCluster(4)

df1_split <- split(df1_sf, cut(1:nrow(df1_sf), 4, labels = FALSE))

clusterExport(c2, "df2_sf")
clusterEvalQ(c2, library(RANN))


system.time(
  idxlist_parallel <- clusterApply(c2, df1_split, 
                                   function(x) nn2(df2_sf, x, k = 1, treetype = 'bd', searchtype = 'priority'))
)
Humpelstielzchen
  • 6,126
  • 3
  • 14
  • 34
  • Thank you @Humpelstielzchen. This seems to work, however it also takes a long time. I've just tried with one location in the first dataset and it's taking over 20 minutes so far (still running). I just wondered if there was anything that would be a lot quicker, as I have to do this for over 400,000 points? – user5481267 Apr 01 '19 at 14:08
  • sure, this takes time. Maybe you can fasten things up by parallelizing. I have made and update in regard to that. – Humpelstielzchen Apr 01 '19 at 14:44
  • Huh not sure why but the original job I had was killed and even when I try and run the code in R it kills the session. The other method with distm does not have this problem and runs more quickly for a small subset of data – user5481267 Apr 01 '19 at 15:05
  • `distm()` is quicker than `nn2()`? Not for me. `nn2()` running on two cores is already 30 times faster and your function is still running. Parallelizing is only usefull for big datasets. Small datasets need longer when parallelizing because of the overhead attached to the method. – Humpelstielzchen Apr 01 '19 at 15:16
  • Ah I see I wasn't paralleliszing, so yes that would probbaly be quicker. I need to figure out how to do it with the system and to understand the above, as I'm not used to do doing it that way! Thank you – user5481267 Apr 01 '19 at 16:02