5

Scipy (http://www.scipy.org/) offers two KD Tree classes; the KDTree and the cKDTree.

The cKDTree is much faster, but is less customizable and query-able than the KDTree (as far as I can tell from the docs).

Here is my problem: I have a list of 3million 2 dimensional (X,Y) points. I need to return all of the points within a distance of X units from every point.

With the KDtree, there is an option to do just this: KDtree.query_ball_tree() It generates a list of lists of all the points within X units from every other point. HOWEVER: this list is enormous and quickly fills up my virtual memory (about 744 million items long).

Potential solution #1: Is there a way to parse this list into a text file as it is writing?

Potential solution #2: I have tried using a for loop (for every point in the list) and then finding that single point's neighbors within X units by employing: KDtree.query_ball_point(). HOWEVER: this takes forever as it needs to run the query millions of times. Is there a cKDTree equivalent of this KDTree tool?

Potential solution #3: Beats me, anyone else have any ideas?

nickhar
  • 19,981
  • 12
  • 60
  • 73
Dlinet
  • 1,193
  • 3
  • 15
  • 22

2 Answers2

4

From scipy 0.12 on, both KD Tree classes have feature parity. Quoting its announcement:

cKDTree feature-complete

Cython version of KDTree, cKDTree, is now feature-complete. Most operations (construction, query, query_ball_point, query_pairs, count_neighbors and sparse_distance_matrix) are between 200 and 1000 times faster in cKDTree than in KDTree. With very minor caveats, cKDTree has exactly the same interface as KDTree, and can be used as a drop-in replacement.

jorgeca
  • 5,482
  • 3
  • 24
  • 36
  • Ah that would be excellent. I don't have any skill/experience with compiling from source so maybe I will look into that. Otherwise, unless another solution is posted, I will wait for the new release of scipy. – Dlinet Oct 26 '12 at 15:39
  • @Dlinet Version 0.12 was released last month. – jorgeca May 11 '13 at 00:05
  • 1
    For anyone arriving at this answer in present times (2022), starting from SciPy v1.6.0 `cKDTree` is *"functionally identical to KDTree"* with matching performance (see [Note](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html)). – Gabriel Mar 07 '22 at 12:55
1

Try using KDTree.query_ball_point instead. It takes a single point, or array of points, and produces the points within a given distance of the input point(s).

You can use this function to perform batch queries. Give it, say, 100000 points at a time, and then write the results out to a file. Something like this:

BATCH_SIZE = 100000
for i in xrange(0, len(pts), BATCH_SIZE):
    neighbours = tree.query_ball_point(pts[i:i+BATCH_SIZE], X)
    # write neighbours to a file...
nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • Unless I am understanding you wrong, I think that is exactly what I have listed as potential solution #2 no? The problem with that method as far as I can tell is it takes forever. – Dlinet Oct 26 '12 at 15:34
  • What you suggested was to loop over every single point. Here, what I suggested is to use it in a "batch" mode, so you spend less time iterating. – nneonneo Oct 26 '12 at 16:24
  • Ah interesting, I will look into this. I have never used "batches" before. do you suggest any particular resources for learning more about batches? – Dlinet Oct 26 '12 at 20:37