8

I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.

(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)

Some thoughts:

  • kd-trees offer good performance, but are only suitable for static point sets
  • R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
  • Algorithms with existing implementations are preferable (though this is not necessary)

What's a good choice here?

gsamaras
  • 71,951
  • 46
  • 188
  • 305
Richard
  • 56,349
  • 34
  • 180
  • 251
  • Possible duplicate of https://stackoverflow.com/questions/45887680/efficient-knn-implementation-which-allows-inserts/45903853#45903853 – TilmannZ Sep 18 '17 at 08:18

2 Answers2

5

I agree with (almost) everything that @gsamaras said, just to add a few things:

  • In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram):enter image description here The diagram is taken from this document, see bullet points below for more information.
  • This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
  • Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
  • If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
  • Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). A (my) C++ version is available here.
TilmannZ
  • 1,784
  • 11
  • 18
2

Check the Bkd-Tree, which is:

an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent query and update performance regardless of the number of updates performed on it.

However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).

Play with it in bkdtree.


Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.

An implentation is QuadTree.


R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.


A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.


Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.

nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.


In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • My issue with these algorithms is it uses points, what if i am storing objects with actual dimensions which thus may overlap the quad tree grids or the separation line of a KD tree - none of the examples explain that complexity. – WDUK Feb 18 '22 at 08:22
  • 1
    @WDUK: the answer probably doesn't discuss that because it was not part of the question I asked. – Richard Sep 07 '22 at 14:29
  • Sure but many times these algorithms are needed for physics/particle simulation optimisations and no one ever discusses situations beyond infinitely small points so these algorithms often feel unexplored beyond basic data examples. So its never clear if these algorithms are still useful beyond basic points. @Richard – WDUK Sep 13 '22 at 04:59