0

I'm looking for an efficient way of selecting a relatively large portion of points (2D Euclidian graph) that are the furthest away from the center. This resembles the convex hull, but would include (many) more points. Further criteria:

  • The number of points in the selection / set ("K") must be within a specified range. Most likely it won't be very narrow, but it most work for different ranges (eg. 0.01*N < K < 0.05*N as well as 0.1*N < K < 0.2*N).

  • The algorithm must be able to balance distance from the center and "local density". If there are dense areas near the upper part of the graph range, but sparse areas near the lower part, then the algorithm must make sure to select some points from the lower part even if they are closer to the center than the points in the upper region. (See example below)

  • Bonus: rather than simple distance from center, taking into account distance to a specific point (or both a point and the center) would be perfect.

My attempts so far have focused on using "pigeon holing" (divide graph into CxR boxes, assign points to boxes based on coordinates) and selecting "outer" boxes until we have sufficient points in the set. However, I haven't been successful at balancing the selection (dense regions over-selected because of fixed box size) nor at using a selected point as reference instead of (only) the center.

I've (poorly) drawn an Example: The red dots are the points, the green shape is an example of what I want (outside the green = selected). For sparse regions, the bounding shape comes closer to the center to find suitable points (but doesn't necessarily find any, if they're too close to the center). The yellow box is an example of what my Pigeon Holing based algorithms does. Even when trying to adjust for sparser regions, it doesn't manage well.

Any and all ideas are welcome!

Gaminic
  • 581
  • 3
  • 9

1 Answers1

1

I don't think there are any standard algorithms that will give you what you want. You're going to have to get creative. Assuming your points are embedded in 2D Euclidean space here are some ideas:

  1. Iteratively compute several convex hulls. For example, compute the convex hull, keep the points that are part of the convex hull, then compute another convex hull ignoring the points from the original convex hull. Continue to do this until you have a sufficient number of points, essentially plucking off points on the perimeter for each iteration. The only problem with this approach is that it will not work well for concavities in your data set (e.g., the one on the bottom of your sample you posted).

  2. Fit a Gaussian to your data and keep everything > N standard deviations away from the mean (where N is a value that you'd have to choose). This should work pretty well if your data is Gaussian. If it isn't, you could always model it with several Gaussians (instead of one), and keep points with a joint probability less than some threshold. Using multiple Gaussians will probably handle concavities decently.
    References:
    http://en.wikipedia.org/wiki/Gaussian_function
    How to fit a gaussian to data in matlab/octave?\

  3. Use Kernel Density Estimation - If you create a kernel density surface, you could slice the surface at some height (e.g., turning it into a plateau), giving you a perimeter shape (the shape of the plateau) around the points. The trick would be to slice it at the right location though, because you could end up getting no points outside of the shape, but with the right selection you could easily get the green shape you drew. This approach will work well and give you the green shape in your example if you choose the slice point wisely (which may be difficult to do). The big drawback of this approach is that it is very computationally expensive. More information: http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation

  4. Use alpha shapes to get a general shape the wraps tightly around the outside perimeter of the point set. Then erode the shape a little to force some points outside of the shape. I don't have a lot of experience with alpha shapes, but this approach will also be quite computationally expensive. More info: http://doc.cgal.org/latest/Alpha_shapes_2/index.html

Community
  • 1
  • 1
mattnedrich
  • 7,577
  • 9
  • 39
  • 45