1

I have two sets of coordinates and trying to find closest matches of coordinates. Given that one dataset consists of 1 million records and the other is nearly half a million records, looking for a better way to get this task done and requires suggestions.

dput of first data set is

structure(list(longitude = c(-2.5168477762, -2.5972432832, -2.5936692407, 
-2.5943475677, -2.5923214528, -2.5919014869, -2.5913454553, -2.5835739992, 
-2.5673150195, -2.5683356381), latitude = c(51.4844052488, 51.45278562, 
51.4978889752, 51.4979844501, 51.4983813479, 51.4982126232, 51.4964350456, 
51.4123728037, 51.4266239227, 51.4265740193)), .Names = c("longitude", 
"latitude"), row.names = c(NA, 10L), class = "data.frame")

dput of second data set is

structure(list(longitude = c(-3.4385392589, -3.4690321528, -3.2723981534, 
-3.3684012246, -3.329625956, -3.3093349806, 0.8718409198, 0.8718563602, 
0.8643998472, 0.8644153057), latitude = c(51.1931124311, 51.206897181, 
51.1271423704, 51.1618047221, 51.1805971356, 51.1663567178, 52.896084336, 
52.896092955, 52.9496082626, 52.9496168824)), .Names = c("longitude", 
"latitude"), row.names = 426608:426617, class = "data.frame")

I have looked at approx and findInterval functions in R but did not understand them fully as to how they work. What I am trying to do is take coordinates from dataset1 and match them to all the coordinates in dataset2 to find the closest match. Currently I am using two forloops but it takes forever due to the size of the data.

The code I have tried is given below:

cns <- function(x,y)
{
 a = NULL
 b = NULL

for(i=1:nrow(x))  
{
  for(j=1:nrow(y)) 
  { 
      a[j]  = distm(c(x$longitude[i],x$latitude[i]),
                c(y$longitude[j],y$latitude[j]),
                fun = distVincentyEllipsoid)

  } 
  b[i] = which(a == min(a))
}
  return(y[b,])
}

The above functions takes one point from dataset1 and calculates the distance using all the points in dataset2 then finds the minimum distance and returns the coordinates of that distance.

Looking for may be parallel processing to acheive this task in a suitable time. Any suggestions welcome.

Regards,

syebill
  • 543
  • 6
  • 23
  • A good way to reduce computational time is to vectorize instead of using for loops. Do you know if distm can take vectors as input? If you want to parallelize, given your code, I suggest looking at foreach from foreach package. – DeveauP Mar 31 '16 at 11:04
  • Yes. Looking at for each and "distm" also take vectors. Do you recommend using vectors to perform the same task and can this method be used to get the same result? – syebill Mar 31 '16 at 11:12

1 Answers1

2

1. Try to vectorize your code

Vectorizing is often more efficient in R than for loops:

  cns2 <- function(x,y){
  b <- numeric(length(nrow(y)))
  for(i in 1:nrow(x)){
    a<- distm(x=x[i,],
                    y=y,
                    fun = distVincentyEllipsoid)

       b[i] = which.min(a)
    }
   return(y[b,])
  }  

Let's assess the difference:

library(microbenchmark)
microbenchmark(cns(x,y), ###where x is your first dataframe, y the second
               cns2(x,y)
               )

Results:

  Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
  cns(x, y) 42.46518 45.16829 46.61517 46.45560 47.09023 80.25171   100
 cns2(x, y) 26.09484 27.33122 28.21505 28.07837 29.10225 30.74004   100

You already reduced your time by half, without parallel computing. Can we increase it even more?

cns3 <- function(x,y){
  b <- numeric(length = nrow(y))

  a<- distm(x=x,
              y=y,
              fun = distVincentyEllipsoid)

  b<-apply(X = a,MARGIN =  1, which.min) 
  return(y[b,])
}

Benchmark returns:

    Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
  cns(x, y) 43.38928 45.69135 48.72223 46.70839 48.56951 135.80555   100
 cns2(x, y) 25.96674 27.15066 28.86999 28.43569 29.99138  35.86383   100
 cns3(x, y) 23.90187 24.84592 26.68738 25.87950 27.99075  34.71469   100

So cns3 seems to be a little faster, but cns2 can be easily parallelized by replacing for with foreach.

Is it correct? The three methods give the same output.

> cns(x,y)
         longitude latitude
426613   -3.309335 51.16636
426613.1 -3.309335 51.16636
426613.2 -3.309335 51.16636
426613.3 -3.309335 51.16636
426613.4 -3.309335 51.16636
426613.5 -3.309335 51.16636
426613.6 -3.309335 51.16636
426613.7 -3.309335 51.16636
426613.8 -3.309335 51.16636
426613.9 -3.309335 51.16636
> cns2(x,y)
         longitude latitude
426613   -3.309335 51.16636
426613.1 -3.309335 51.16636
426613.2 -3.309335 51.16636
426613.3 -3.309335 51.16636
426613.4 -3.309335 51.16636
426613.5 -3.309335 51.16636
426613.6 -3.309335 51.16636
426613.7 -3.309335 51.16636
426613.8 -3.309335 51.16636
426613.9 -3.309335 51.16636
> cns3(x,y)
         longitude latitude
426613   -3.309335 51.16636
426613.1 -3.309335 51.16636
426613.2 -3.309335 51.16636
426613.3 -3.309335 51.16636
426613.4 -3.309335 51.16636
426613.5 -3.309335 51.16636
426613.6 -3.309335 51.16636
426613.7 -3.309335 51.16636
426613.8 -3.309335 51.16636
426613.9 -3.309335 51.16636

2. As often, when asked for minimal values, what to you want to do when there are ties?

With the way you wrote it, you keep all ties, which may be a trouble because b may be coerced to a list a some point.

DeveauP
  • 1,217
  • 11
  • 21
  • Thanks. When it comes comparing 1 million rows with half a million, this time difference can be a deciding factor. You mentioned "ties", are you referring to duplicated values in x,y or both? Also, can parallel be used with cn3 to furthe reduce the computation time? – syebill Mar 31 '16 at 11:46
  • 1
    Good answer! (+1) Regarding `cns2`, the vector `b` isn't preallocated even though its dimensions seem to be clear a priori. Regarding `cns3` (purely cosmetic), `b` doesn't need to be defined beforehand. My guess is that the small difference between your procedures is due to preallocation. – SimonG Mar 31 '16 at 11:46
  • Is there a way where we can extract a subset of approx. values from y before calculating the distance i.e. using approx or findInterval type of function or may be some sorting algorithm? e.g. let's suppose x is 1.134 and y values are 1.567, 1.123, 2.345, 1.115. So instead of calculating the distance for all 4 values of y, we can somehow calculate the distance for the two values? Any ideas – syebill Mar 31 '16 at 11:51
  • @SimonG: I changed the allocation and reran the microbenchmark, still cns3 seems better. You're correct in pointing out the cosmetic, for answers it is important :) syebill: cns2 will be very easy to transform with foreach (see foreach documentation to start and stop a cluster). I don't know if ordering would help honestly, as I don't know how distm works. – DeveauP Mar 31 '16 at 11:59
  • 1
    An alternative `cns4 <- function(x,y){ a <- distm(x, y,fun = distVincentyEllipsoid) b <- sapply(1:nrow(a), function(i) which.min(a[i,])) return(y[b,]) } ` – G. Cocca Mar 31 '16 at 19:53