0

I have a 10,000 observation dataset with a list of location information looking like this:

 ADDRESS                 |    CITY        |  STATE | ZIP   |LATITUDE   |LONGITUDE

1189 Beall Ave           |  Wooster       | OH  | 44691 | 40.8110501   |-81.93361870000001

580 West 113th Street    |  New York City | NY  | 10025 | 40.8059768   | -73.96506139999997

268 West Putnam Avenue   |  Greenwich     | CT  | 06830 | 40.81776801  |-73.96324589997

1 University Drive       |   Orange       | CA  | 92866 | 40.843766801 |-73.9447589997

200 South Pointe Drive   |  Miami Beach   | FL  | 33139 | 40.1234801   |-73.966427997

I need to find the overlapping locations within a 5 mile and 10 mile radius. I heard that their is a function called geodist which may allow me to do that, although I have never used it. The problem is that for geodist to work I may need all the combinations of the latitudes and longitudes to be side by side, which may make the file really really large and hard to use. I also, do not know how I would be able to get the lat/longs for every combination to be side by side.

Does anyone know of a way I could get the final output that I am looking for ?

V_N
  • 23
  • 4
  • To get the best answers, try the simple way first, and show your work. To paraphrase an aphorism, 100M records isn't what it used to be, and SAS should scale easily to that range. If for some reason you have a limited environment, (e.g., you are interfacing with Excel), then there are many sensible ways to pre-process the data. People will be glad to reward your effort with a large collection of approaches. – Leo Jan 31 '17 at 04:10

1 Answers1

0

Here is a broad outline of one possible approach to this problem:

  • Allocate each address into a latitude and longitude 'grid' by rounding the co-ordinates to the nearest 0.01 degrees or something like that.
  • Within each cell, number all the addresses 1 to n so that each has a unique id.
  • Write a datastep taking your address dataset as input via a set statement, and also load it into a hash object. Your dataset is fairly small, so you should have no problems fitting the relevant bits in memory.
  • For each address, calculate distances only to other addresses in the same cell, or other cells within a certain radius, i.e.
    1. Decide which cell to look up
    2. Iterate through all the addresses in that cell using the unique id you created earlier, looking up the co-ordinates of each from the hash object
    3. Use geodist to calculate the distance for each one and output a record if it's a keeper.

This is a bit more work to program, but it is much more efficient than an O(n^2) brute force search. I once used a similar algorithm with a dataset of 1.8m UK postcodes and about 60m points of co-ordinate data.

user667489
  • 9,501
  • 2
  • 24
  • 35