0

The stu.csv contains 850,000 rows and 3 columns. The 2nd column is the longitude of ID, the 3rd column is the latitude of ID. The data in stu.csv file is like this:

   ID    longitude    latitude  
  156   41.88367183 12.48777756
  187   41.92854333 12.46903667
  297   41.89106861 12.49270456
  89    41.79317669 12.43212196
  79    41.90027472 12.46274618
  ...       ...         ...

The pseudocode is as follows. it aims to compute the distance between two IDs on the surface of the earth with longitude and latitude, and outputs the cumulative sum from any two IDs:

  dlon = lon2 - lon1
  dlat = lat2 - lat1
  a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2
  c = 2 * atan2( sqrt(a), sqrt(1-a) )
  distance = 6371000 * c (where 6371000 is the radius of the Earth)

This code is as follows, but it runs too slow. how to speed and rewrite the code? Thank you.

    stu<-read.table("stu.csv",header=T,sep=",");

    ## stu has 850,000 rows and 3 columns.

    m<-nrow(stu);

    distance<-0;

    for (i in 1:(m-1))
    {
      for (j in (i+1))
      {     
        dlon = stu[j,2] - stu[i,2];
        dlat = stu[j,3] - stu[i,3];
        a = (sin(dlat/2))^2 + cos(stu[i,3]) * cos(stu[j,3]) * (sin(dlon/2))^2;
        c = 2 * atan2( sqrt(a), sqrt(1-a) );
        distance <-distance+6371000 * c;
       }
    }

    distance
user2405694
  • 847
  • 2
  • 8
  • 19
  • 3
    An O(N^2) algorithm is always going to take a long time when N=850000.... – Hong Ooi Apr 15 '16 at 02:20
  • @chinsoon12 Thank you for your reply. I describe the problem – user2405694 Apr 15 '16 at 02:44
  • besides coding in pure C and still having to wait, you might need to reduce your dimensionality – chinsoon12 Apr 15 '16 at 02:55
  • Just to clarify - you want the sum of the distance between every point and every other point in your df (this is what your code is trying to do now)? I'm thinking you might actually want the cumulative distance from one line to the next line? – jeremycg Apr 15 '16 at 02:58
  • How accurate does the result have to be? If the lat long of all the points are relatively (subjective judgment) close to each other, you may want to come up with a new function to approximate the distance calculation but doing away with the costly trigonometric calculations (ie sin, cos). Substitute them with a reasonable constant, or some linear calculation that can be vectorised. – Ricky Apr 15 '16 at 03:01

2 Answers2

1

For your case, if it is the cumulative distance, we can vectorize:

x <- read.table(text = "ID    longitude    latitude  
156   41.88367183 12.48777756
187   41.92854333 12.46903667
297   41.89106861 12.49270456
89    41.79317669 12.43212196
79    41.90027472 12.46274618", header= TRUE)


x$laglon <- dplyr::lead(x$longitude, 1)
x$laglat <- dplyr::lead(x$latitude, 1)


distfunc <- function(long, lat, newlong, newlat){
  dlon = newlong - long
  dlat = newlat - lat
  a = (sin(dlat/2))^2 + cos(lat) * cos(newlat) * (sin(dlon/2))^2
  c = 2 * atan2( sqrt(a), sqrt(1-a) )
  6371000 * c 
}

distfunc(x$longitude, x$latitude, x$laglon, x$laglat)
308784.6 281639.6 730496.0 705004.2       NA

Take the cumsum of the output to get the total distance.

On a million rows, it takes around 0.4 secs on my system

jeremycg
  • 24,657
  • 5
  • 63
  • 74
-1

What you're looking for is called "vectorizing a loop." See this related question.

The basic idea is that in a loop, the CPU has to finish processing the first cell before moving on to the second cell, unless it has some guarantee that how the first cell is processed won't effect the state for the second cell. But if it's a vector computation, that guarantee exists, and it can work on as many elements simultaneously as possible, leading to faster speed. (There are other reasons why this works better, but that's the basic motivation.)

See this introduction to apply in R for how to rewrite your code without the loop. (Much of the calculation you should be able to maintain as is.)

Community
  • 1
  • 1
Matthew Gray
  • 1,668
  • 15
  • 17