Algorithm to discard irrelevant latitude longitude pairs

Question

I am trying to compute the best latitude longitude pairs for several locations. I have a database with locations and for each location I may have multiple coordinates. Most of these coordinates seem relevant for the location as they are located within 5 meters from each other. So I can derive a new (final) latitude longitude pair by averaging them.

Sometimes however I have a point (sometimes more then one) that is located several hundred meters away.

Given a set of few (maximum 10) latitude longitude points, I would like to find and keep only those points that make sense and discard those who are too far away from others.

What approach / algorithm should I use ?

Note I work with Java.

More information at http://stackoverflow.com/questions/18805178/how-to-detect-outliers-in-an-arraylist — gknicker, Dec 28 '14 at 22:40

Mshnik · Answer 1 · 2014-12-28T22:41:52.717

The simplest approach is likely to be:

Find the centroid (average long/lat) point for a given set of points
Compute the distance from each point in the set to the centroid. Discard all points with a distance over a certain constant value (calling these points noise)
Recompute the centroid from the remaining non-noise points, call that the location.

This should be pretty simple to implement in java and certainly can be O(N), N being the number of points in your set.

Your problem is a specific case of K-means clustering, in that you know which real-world data correspond to which samples whereas in the general case you don't have that knowledge. So look into that problem and assorted approaches if you want more research.

score 1 · Accepted Answer · answered Dec 28 '14 at 22:55

Simple approach:

Compute the distance of all points to some arbitrary point.
Find the median distance of all points.
Discard all points whose abs (dist - median) > value.

A bit better than the centroid approach which could get screwed by few far away points that are clustered together.

Michal Lozinski · Answer 3 · 2014-12-28T22:43:00.527

There are a couple of questions you need to ask yourself:

Which point should be treated as "not making sense" if you have only two points being 100 meters away.
Which point should be treated as "not making sense" if you have two separate clusters of points?
What should you do if you have a continuous row of points that still fit within the margin of error counting to the closest neighbour, but in total span over the limit?

The question you've asked is hard to answer without clear criteria, although I'd try looking through clustering algorithms.

If we would skip problems I've mentioned, I'd say that it's computationally heavy, but you can go by

calculating the distances between all points in given set
sorting them by the sum of distances
filtering out the one with highest sum
Iterating over until there are no points for which the sum of distances is greater than errorMargin * N-1 where N is the current number of points.

Still you need to take the border cases into consideration, cause for instance problem mentioned in 1) would leave you with a single random point - I doubt you're ok with that, so you need to carefully analyse your domain.

score 0 · Answer 4 · answered Dec 28 '14 at 23:21

If you are using Java8 then the following code provides an elegant solution.

Collector<Location, ?, Location> centreCollector = new CentreCollector();
Location centre = locations.stream().collect(centreCollector);
centre = locations.stream().filter(centre::furtherThan(NOISE_DISTANCE)).collect(centreCollector);

You have 2 things to create. The CentreCollector class which implements Collector and averages Location objects as they are streamed to it; and the furtherThan method which returns a Predicate that compares the distance between this and a given location to a given distance.

A slightly more elegant method would be to calculate the standard deviation of the distances to the centre and then discard any locations that are more than a certain number of standard deviations from the average distance. This would have the advantage of taking account of sets of locations in which all or most of the samples are more than the NOISE_DISTANCE from the centre. In that case the CentreCollector will have to return a more complex object that holds the location and statistical information and have furtherThan as a member of that class rather than of Location. Let me know in the comments if you want me to post the equivalent code for using standard deviations.

Algorithm to discard irrelevant latitude longitude pairs

4 Answers4