4

I have 2 sets of nodes - Set A and Set B. Each set is of size 25,000.

I am given a percentage (lets say 20%). I need to find the minimum distance such that 20% of the nodes in Set A are within that distance of any node in Set B.

Solution:

Find the 20% of Set A which is closest to any node in Set B. The answer is the node in that 20% which is the farthest from any node in Set B.

Brute Force Solution:

        foreach (Node a in setA)
        {
            a.ShortestDistance = infinity;
            foreach (Node b in setB)
            {
                if (a.DistanceTo(b) < a.ShortestDistance)
                {
                    a.ShortestDistance = a.DistanceTo(b);
                }
            }
        }
        setA.SortByShortestDistance();
        return setA[setA.Size * 0.2];

This works, but the time it would take is insane. (O(n^2 + Sort) I think?)

How can I speed this up? I would like to hit O(n) if possible.

Evorlor
  • 7,263
  • 17
  • 70
  • 141

2 Answers2

1

Following is a algorithm which might improve speed:-

  1. convert your (lat,long) pairs to (x,y,z) in cartesian with centre of earth as origin
  2. distance between (x,y,z) in cartesian is lower bound to the actual distances in spherical co-ordinates.
  3. Construct to separate 3d trees for setA and setB.
  4. for each node a in setA search for nearest neighbour in 3d tree of setB which in average case is O(logN).
  5. Then distance for nearest neighbour would be the distance from nearest neighbour.
  6. Then sort setA as you have done.

Time complexity :-

In average case : O(n*logn)

In worst case : O(n^2)

Vikram Bhat
  • 6,106
  • 3
  • 20
  • 19
  • i like the idea, but im unsure whether or not the increase in speed is worth the loss in accuracy. something i will have to consider – Evorlor Jun 25 '14 at 05:54
1

You could pick the smaller of the two sets and build a structure from it for answering nearest-neighbour queries - http://en.wikipedia.org/wiki/Cover_tree doesn't make many assumptions about the underlying metric so it should work with haversine/great circle.

After doing this, the simplest thing to do would be to take every member of the larger set, find the nearest neighbour to it in the smaller set, and then sort or http://en.wikipedia.org/wiki/Quickselect the distances. If you modified the find operation to return early without finding anything if the nearest object must be further than a threshold distance away and you had a rough idea of the distance you might save some time.

You could get a rough idea by performing the same operation on a random sample from the two sets beforehand. If your guess is a bit too high, you just have a few more nearest neighbour distances to sort. If your guess is a bit too low, you only need to repeat the find operations for those points where the nearest neighbour operation returned early without finding anything.

mcdowella
  • 19,301
  • 2
  • 19
  • 25