0

I have 2 tables

USER contains a UserID and 2 zip codes (around 1 million records)

+--------+----------+----------+
| UserID | Zipcode1 | Zipcode2 |
+--------+----------+----------+
|      1 |    08003 |    10016 |
|      2 |    11780 |    48073 |
|      3 |    57106 |    33487 |
+--------+----------+----------+

LOCATION contains a LocationID and a zip code (around 1000 records)

+------------+---------+
| LocationID | Zipcode |
+------------+---------+
|          1 |   33004 |
|          2 |   96818 |
|          3 |   08816 |
+------------+---------+

I have a function that takes the zipcodes, joins to a table with latitude/longitude, calculates the distance between user zips and the location zip, and returns the shorter distance.

dbo.fnMinZipDistance(Location.Zipcode, User.Zipcode1, User.Zipcode2)

example from user 1 & location 1:
dbo.fnMinZipDistance('33004', '08003', '10016') returns 995.383 
because the distance from 33004 to 10016 is 995.383 
and the distance from 33004 to 08003 is 1067.802

For each UserID, I need the 3 LocationIDs with the shortest distance to either user zipcode.

My initial avenue of attack was to get the distance for each location, partition by user ordered by distance, and select where the row number is < 4

WITH UserLocations as 
(
    SELECT
    U.UserID, 
    L.LocationID, 
    rowNum = row_number() Over(partition by U.UserID ORDER BY dbo.fnMinZipDistance(L.Zipcode, U.Zipcode1, U.Zipcode2))
    FROM
    USERS U
    JOIN LOCATIONS L ON 1 = 1
)
SELECT * FROM UserLocations WHERE rowNum < 4

This is taking multiple days to run, since before I can get the 3 closest, I must get all distances which total around 1 billion records.

My next thought was to precalculate every possible zip code distance in a table to join to, but that would be around 1.8 billion possible combinations (roughly 43000 active US zip codes), and I'm not sure how much that will help. I'm running it for comparison now, and I am at 6 hours on the attempt.

I need to drastically reduce the run time for this query.

Any suggestions would be extremely appreciated.

Broom
  • 596
  • 2
  • 18

1 Answers1

0

Precalculating the distance from each zip code to each location zip code is probably the best way to go.

Only dealing with 44,000 zip codes is already a big win from dealing with the 2 million in your original data. You can just keep, say, the five closest zip codes in the summary table.

You can also simplify the search. I think you can assume that the closest zip codes are within some distance -- say 100 miles -- of a given zip code. This would allow you to put ranges on latitude and longitude for the search.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786