1

I have two lists of postcodes for Germany (different length). I need to find the closest postcode neighbour for each one in the first dataframe with respect to the second list. I do have longitutde and latitude information as well. It would be great to see which plz is closest, but my main interest is a distance measure (at the moment I am flexible with any distance measure). I could calculate all possible combination (2000000) and calculate their distances via geosphere package or Google Maps DirectionFinder (and select the smallest distance). I think I need to apply some form of https://en.wikipedia.org/wiki/Nearest_neighbor_search There are about 10.000 origins and about 200 destinations. I came across RANN package and nn2() with the option searchtype = priority instead of searchtype = radius (which I don't need).

    plz       city             lon             lat
1 69115 Heidelberg 8.6934499740601 49.406078338623
2 44137   Dortmund       7.4582135      51.5143952
3 70178  Stuttgart         9.17115        48.77426
4 68159   Mannheim 8.4696736826668 49.491940248873
5 68167   Mannheim       8.4971965      49.5038859

    plz            city       lon        lat
1 76530     Baden-Baden 8.2423068 48.7438178
2 89081             Ulm  9.961367 48.4253282
3 69120      Heidelberg 8.6752461 49.4225417
4 72076        Tübingen 9.0406256 48.5312051
5 74523 Schwäbisch-Hall 9.7424451 49.1247435
Marco
  • 2,368
  • 6
  • 22
  • 48
  • I'd just use lat/lon as x-y (or y-x if you prefer) coordinates and calculate a Euclidean distance measure. At the scales of postcode areas the more complicated derivations such as geodetic distance between lat/lon pairs is just going to add computation time for nothing (?). – High Performance Mark Sep 13 '22 at 13:18
  • I have done it for few data with `Rfast` package and `dista()`. But (1) I was not sure how to interpret those distances (do they have units)? (2) I would need to calculate all 2 million calculations because I don't have definite pairs. I assume that is too resource expensive. – Marco Sep 13 '22 at 14:34
  • *I would need to calculate all 2 million calculations* No, you'd need some sort of spatial index structure so you only do nearest-neighbours calculations for a fraction of the total. Divide Germany into buckets 5 arc-minute x 5 arc-minute 'square' (maybe larger maybe smaller) and only compute the n-n distances for postcodes in the same bucket or neighbouring buckets. – High Performance Mark Sep 13 '22 at 14:37

1 Answers1

0

I would use the FNN package to find the nearest neighbour for each plz, based on Euclidean distances of the longitude and latitude. For example:

library(data.table)
library(FNN)

df1 <- fread("plz       city             lon             lat
69115 Heidelberg 8.6934499740601 49.406078338623
44137   Dortmund       7.4582135      51.5143952
70178  Stuttgart         9.17115        48.77426
68159   Mannheim 8.4696736826668 49.491940248873
68167   Mannheim       8.4971965      49.5038859")

df2 <- fread("plz            city       lon        lat
76530     Baden-Baden 8.2423068 48.7438178
89081             Ulm  9.961367 48.4253282
69120      Heidelberg 8.6752461 49.4225417
72076        Tübingen 9.0406256 48.5312051
74523 Schwäbisch-Hall 9.7424451 49.1247435")

nearest_neighbours <- get.knnx(df1[,.(lon,lat)],df2[,.(lon,lat)],k=1)

The nearest_neighbours object includes two lists, $nn.index giving the index of the nearest neighbour in the second table; and $nn.dist giving the Euclidean distance to the nearest neighbour in the second table.

$nn.index
     [,1]
[1,]    4
[2,]    3
[3,]    1
[4,]    3
[5,]    3

$nn.dist
           [,1]
[1,] 0.78190978
[2,] 0.86382655
[3,] 0.02454431
[4,] 0.27588458
[5,] 0.67023636
rw2
  • 1,549
  • 1
  • 11
  • 20