1

For this example, I have one table which includes a list of people, a group category, and the location for each individual person (Long/Lat coordinates). A single individual can be in multiple groups. Here is an example table:

Person  Group   Long     Lat
1       1       11       23
2       1       12       24
.       .       .        .
.       .       .        .
.       .       .        .
2       2       12       24

I have another table which lists Businesses, their locations, and a shared group that matches the groupings in the first table. Again, a Business can be in multiple groups. Example table:

Busns   Group   Long     Lat
5       1       5        6
6       1       6        7
.       .       .        .
.       .       .        .
.       .       .        .
5       2       5        6

I want to, by Person and by Group, match the Business with minimum distance between the two. This is proving to be a very memory-intensive task as I have it. Currently I create an enormous table through a RIGHT JOIN, which then measures the distance between a person and a business for every group. then I create another that finds the minimum distance for each person in a group, and then I do an INNER JOIN in order to pair the original table down. Example code:

DROP TABLE IF EXISTS DistancePairs;
CREATE LOCAL TEMPORARY TABLE DistancePairs ON COMMIT PRESERVE ROWS AS (
SELECT  a.Person
        ,a.Group
        ,b.Business
        ,a.Latitude AS PersonLat
        ,a.Longitude AS PersonLong
        ,b.Latitude AS BusinessLat
        ,b.Longitude AS BusinessLong
        ,0.621371*DISTANCEV(a.Latitude,a.Longitude,b.Latitude,b.Longitude) AS AproxDistance
FROM people a
RIGHT JOIN business b
ON a.Group = b.Group
);

DROP TABLE IF EXISTS MinDist;
CREATE LOCAL TEMPORARY TABLE MinDist ON COMMIT PRESERVE ROWS AS (
SELECT DISTINCT
    Person
    ,Group
    ,MIN(AproxDistance) AS AproxDistance
FROM Distance Pairs
);

SELECT  a.Person
        ,a.Group
        ,a.Business
        ,a.AproxDistance
FROM DistancePairs a
JOIN MindDist b
ON a.Person = b.Person
AND a.Group = b.Group
AND a.AproxDistance = b.AproxDistance
;

Is there a better way to do this? This performs terribly and runs for hours given the size of the data set I'm using. The original Person and Business tables have already been created using WHERE statements to limit their size.

Analyst Guy
  • 115
  • 13
  • At the risk of making it way more complicated, maybe a hierarchical approach would help where every location is first assigned to a "quadrant" (say like 1000 of them, equal lat/long bins or maybe k-means clustering). Then approximate quadrant boundaries are calculated using a simple lat/long rollup. Then min/max distance between each quadrant is calculated from those. Then for point to point start from nearest quadrants and stop early. Take with grains of salt as I'm not geography experienced and there are some gotchas on which min/max to use to avoid the approximations. – crokusek Jul 27 '18 at 22:19
  • I guess the first thing you did was to analyze the explain plan of your Q. Can you please share it? – mauro Jul 28 '18 at 16:46

1 Answers1

1

Could you try formulating the query with a join, followed by an analytic LIMIT clause?

I only have your little bit of example data, so I can't really test it for sense or nonsense. But here goes:

WITH
-- this is your input data ...
persons        ( Person, grp,  Long,    Lat ) AS (
          SELECT 1   ,   1   ,   11  ,    23
UNION ALL SELECT 2   ,   1   ,   12  ,    24
UNION ALL SELECT 2   ,   2   ,   12  ,    24
)
,
-- and this, is also your input data ....
businesses     (Busns,  grp,  Long,    Lat) AS (
          SELECT 5   ,   1   ,   5  ,     6
UNION ALL SELECT 6   ,   1   ,   6  ,     7
UNION ALL SELECT 5   ,   2   ,   5  ,     6
)
,
-- real WITH clause would start here ....
join_and_calc AS (
SELECT
  person
, p.grp
, busns
, p.lat
, p.long
, b.lat
, b.long
, 0.621371 * DISTANCEV(p.lat,p.long,b.lat,b.long) AS app_dist
FROM persons    p
JOIN businesses b USING(grp)
)
SELECT
  *
FROM join_and_calc
LIMIT 1 OVER(PARTITION BY person,grp,busns ORDER BY app_dist)
;

The result I get is:

 person | grp | busns | lat | long | lat | long |     app_dist     
--------+-----+-------+-----+------+-----+------+------------------
      1 |   1 |     5 |  23 |   11 |   6 |    5 | 1235.42458453758
      1 |   1 |     6 |  23 |   11 |   7 |    6 | 1149.36524763703
      2 |   1 |     5 |  24 |   12 |   6 |    5 | 1322.28298287477
      2 |   1 |     6 |  24 |   12 |   7 |    6 | 1234.90557929051
      2 |   2 |     5 |  24 |   12 |   6 |    5 | 1322.28298287477

Good luck - Marco

marcothesane
  • 6,192
  • 1
  • 11
  • 21