Input:
• N points {P1, …. , Pn} - each point is from the same dimension t:
- Pi = {x_1, …., x_t} where k is between 18-30 .
• Distance function – dist(Pi, Pj) - returns a number that is the distance between the points. (The function is a custom function – not a standard Minkowski distance).
The Problem:
• Main problem:
- Find the K closest pairs from all the N points – as fast as possible.
• Secondary problem:
- Given a point Q = {x_1, …, x_t} return the K closest pairs.
• Nice to Have:
- A Database where we can add/remove a point Pi and the above queries will run as fast as possible.
Relevant Data structures:
• KD-TREE
• R-TREE
• BALL-TREE
Possible Solutions:
• Main problem:
Build a BallTree (sklearn.neighbors.BallTree).
For every point P in the BallTree find K closest pairs (Now we have N List- where every list contains the K closest pairs for every Point Pi).
Take the best K pairs from all the list above.
• Secondary problem:
Build a BallTree (sklearn.neighbors.BallTree).
Query for closest k pairs for a given point Q.
Time Complexity So far:
For every point in the tree (N in total) find K closest pairs which take O(K*log(N)) - So in total O(N * K * log(N)).
Take the best K pairs from N sorted list - can take O( Max{ K * log(K), N } ). for example with maintaining a min HEAP of size K.
The total complexity, for now, is O(N * K * log(N)) - can we do better?