0

I have two dataframes that have three variables each: location_id, latitude and longitude. For every location_id in the first data frame, I have to find the closest location_id in the second dataframe, in addition to the distance between the location_id from each df.

I've tried using expand.grid to give me every possible combination of the two data frames together (worked), but then when I tried to merge the latitude and longitudes from the original lists onto my super list, I ran out of memory (there are 7000 location_ids in the first dataframe and 5000 location_ids in the second data frame).

I was able to get the equation to calculate the distance between two points from elsewhere on stack overflow:

earth.dist <- function (long1, lat1, long2, lat2)
{
rad <- pi/180
a1 <- lat1 * rad
a2 <- long1 * rad
b1 <- lat2 * rad
b2 <- long2 * rad
dlon <- b2 - a2
dlat <- b1 - a1
a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
R <- 6378.145
d <- R * c
return(d)
}

but I'm having a hard time applying it in the context of this problem. Any help is appreciated!

EDIT:

The sets of data look exactly like this:

 location_id LATITUDE  LONGITUDE
211099    32.40913     -99.78064
333547    32.45192     -100.39325
369561    32.47458     -99.69176
123141    33.68169     -96.60887
386913    33.99921     -96.40743
123331    31.96173     -83.75830
A Trask
  • 13
  • 2
  • 7
  • can you give a reproducible example please? I think there may be a solution based on `outer()` to measure the lat-distances and long-distances among pairs of points first, then combine them (this certainly works for Euclidean distances -- `sqrt(outer(x1,x2,"-")^2+outer(y1,y2,"-")^2)` but it might take some more work to make it work with great-circle distances) – Ben Bolker Jul 30 '15 at 19:25
  • I'm guessing there's a tool to build sets of points (like your dfs) into polygons and then to take distances between polygons. This might be relevant: http://stackoverflow.com/q/8579913/1191259 – Frank Jul 30 '15 at 19:25

1 Answers1

3

This might help you. It's not the most elegant answer but for a data.frame for your size this should do the job fairly well.

require(geosphere)
require(dplyr)

DB1 <- data.frame(location_id=1:7000,LATITUDE=runif(7000,min = -90,max = 90),LONGITUDE=runif(7000,min = -180,max = 180))
DB2 <- data.frame(location_id=7001:12000,LATITUDE=runif(5000,min = -90,max = 90),LONGITUDE=runif(5000,min = -180,max = 180))

DistFun <- function(ID){
 TMP <- DB1[DB1$location_id==ID,]
 TMP1 <- distGeo(TMP[,3:2],DB2[,3:2])
 TMP2 <- data.frame(DB1ID=ID,DB2ID=DB2[which.min(TMP1),1],DistanceBetween=min(TMP1)      ) 
 print(ID)
 return(TMP2)
}

DistanceMatrix <- rbind_all(lapply(DB1$location_id, DistFun))



head(DistanceMatrix)

Source: local data frame [6 x 3]

  DB1ID DB2ID DistanceBetween
1     1  9386        24907.35
2     2 11823       264295.86
3     3  9118        12677.62
4     4 11212       237730.78
5     5 11203        26775.01
6     6  7607        83904.84
eclark
  • 819
  • 7
  • 16
  • This is awesome! I'm still messing around with it, but it looks like it excludes a location_id from DB2 after its been paired to one in DB1. Is there a way to allow this matching process to happen with replacement? – A Trask Aug 05 '15 at 16:55
  • Don't think it does. As you can see if you do tail(sort(table(DistanceMatrix$DB2ID))) you get more than one observation for some location_id from DB2 – eclark Aug 05 '15 at 21:56