How to compute big geographic distance matrix

Question

I have a dataframe of ids and coordinates. I need to calculate the geographic distance between all my ids, drop the ones that are too far from each other, and then go on with my analysis.

I have 30k ids, which would generate a 30k x 30k matrix. Here is a sample:

 latitude longitude        id
-23.52472 -46.47785 917_62346
-23.62010 -46.69345 244_42975
-23.61636 -46.48148 302_75289
-23.53826 -46.46756 917_96304
-23.58266 -46.54495 302_84126
-23.47005 -46.70921 233_97098
-23.49235 -46.49342 917_62953
-23.52226 -46.72710 244_42245
-23.64853 -46.72237 635_90928
-23.49640 -46.61215  244_2662

x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624, 
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581, 
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512, 
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192, 
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975", 
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953", 
"244_42245", "635_90928", "244_2662")), .Names = c("latitude", 
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L, 
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")

First I tried to go straight for it, using the geosphere package:

library(geosphere)
library(data.table)
d.matrix <- distm(cbind(x2$longitude, x2$latitude))

This does not work, because of memory issues, Error: cannot allocate vector of size 15.4 Gb. My second attempt was to first generate all the pairwise combinations beforehand, than merge with the original data set to get the lats and lons, and then calculate the distances, such as

dis.long <- expand.grid(x2$id, x2$id)
dis.long <- merge(dis.long, x2, by.x = "Var1", by.y = "id")
dis.long <- merge(dis.long, x2, by.x = "Var2", by.y = "id")
dis.long <- dis.long[ , dist_km2 := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2), 
                                        matrix(c(longitude.y, latitude.y), ncol = 2))/1000]

However, expand_grid runs out of memory. This is strange to me, since the resulting matrix would be 900mi rows by 2 cols, and I already deal with data sets way larger (like 200 mi x 50 matrices).

Another observation, I already tried using ids such as new_id = seq(1L,30000L,1L), to see whether integers would solve it, but I get the same memory issue when I try to expand.

I am currently under these configurations, besides a 16gb Ram desktop

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xlsx_0.5.7        xlsxjars_0.6.1    rJava_0.9-8       geosphere_1.5-5   sp_1.2-5          haven_1.0.0      
[7] stringr_1.2.0     data.table_1.10.4

Can anybody give me an idea of how to calculate these distances? And why I cant generate this specific expand.grid while being able to construct bigger objects?

The `expand.grid` step will create 900 *million* rows of 2 columns - each of 30K `id`s for each 30K `id`s is 3x10^4 * 3x10^4 = 9x10^8. expand.grid is just giving you the 30Kx30K matrix in long form. — Gavin Simpson, Sep 01 '17 at 16:51
@user10089632 It looks like a duplicate, but, what I am looking for, if there is any, is a way of computing these distances that does not kill my memory (maybe in parallel). — Felipe Alvarenga, Sep 01 '17 at 16:57
Ok, removing that, it was just 'Possible!', but I still think it can be very instructive to look at the hints provided at the accepted answer there. — user10089632, Sep 01 '17 at 17:00
I had already bumped into that question, which was quite useful, but did not solve my issue. — Felipe Alvarenga, Sep 01 '17 at 17:02
@FelipeAlvarenga If you can't create the entire 900 million element matrix plus a copy or two you aren't going to be able to *anything* with it *in memory*. However you haven't said what you want to do with these distances. It may well be that the ultimate use can be chunked so that you never need to hold the entire distance matrix in memory. — Gavin Simpson, Sep 01 '17 at 17:03
Essentially, I need to create clusters between ids, so I would drop all pairwise connections which had distances over 5km, and then proceed with spatial econometrics. I already have a script that runs a loop, calculating the distance of a single id to all others, droping long distances and storing the rest. But this approach is too slow — Felipe Alvarenga, Sep 01 '17 at 17:07
@FelipeAlvarenga You could use rounding or cut() or some similar approach to simply divide up your 30K ids into grids or boxes of at least 5 km square. Then you just need to calculate distances for each id to ids in its own grid cell and adjacent cells. This is assuming that, if the distance between two ids is over 5km, you don't care about the actual value of that distance. — Adrian Martin, Sep 01 '17 at 18:28

score 2 · Accepted Answer · answered Sep 01 '17 at 19:47

You do not need to compare all-vs-all, which includes self-comparison and directional comparison (A-B != B-A); therefore you should try combn instead of expand.grid

Your data

x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624, 
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581, 
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512, 
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192, 
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975", 
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953", 
"244_42245", "635_90928", "244_2662")), .Names = c("latitude", 
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L, 
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")

expand.grid

OP <- function(df) {
            x3 = expand.grid(df$id, df$id)
            Y <- merge(x3, df, by.x = "Var1", by.y = "id")
            Y <- merge(Y, df, by.x = "Var2", by.y = "id")
            return(Y)
      }

vs combn

CP <- function(df) {
            Did = as.data.frame(t(combn(df$id, 2)))
            Z <- merge(Did, df, by.x = "V1", by.y = "id")
            Z <- merge(Z, df, by.x = "V2", by.y = "id")
            return(Z)
      }

Comparison

dis.long <- OP(x2)
object.size(dis.long)
# 7320 bytes

new <- CP(x2)
object.size(new)
# 5016 bytes

Much larger example

num <- 5e2
bigx <- data.frame(latitude=rnorm(num)*-23, longitude=rnorm(num)*-46, id=1:num)

bigdl <- OP(bigx)
object.size(bigdl)
# 10001224 bytes

bignew <- CP(bigx)
object.size(bignew)
# 4991224 bytes

About half the size