0

In an python application I'm developing I have an array of 3D points (of size between 2 and 100000) and I have to find the points that are within a certain distance from each other (say between two values, like 0.1 and 0.2). I need this for a graphic application and this search should be very fast (~1/10 of a second for a sample of 10000 points)

As a first experiment I tried to use the scipy.spatial.KDTree.query_pairs implementation, and with a sample of 5000 point it takes 5 second to return the indices. Do you know any approach that may work for this specific case?

A bit more about the application:

The points represents atom coordinates and the distance search is useful to determine the bonds between atoms. Bonds are not necessarily fixed but may change at each step, such as in the case of hydrogen bonds.

pygabriel
  • 9,840
  • 4
  • 41
  • 54
  • 1
    Did you see https://github.com/scipy/scipy/pull/262 ? Looks like cKDTree is much faster, and the query pairs function should be available in the development version of SciPy. – rici Mar 20 '13 at 04:23
  • I didn't see that, thank you for pointing this out! it seems that the new version of cKDTree would be fast enough for my purposes, I may try to backport that in my app... – pygabriel Mar 20 '13 at 05:52

2 Answers2

5

Great question! Here is my suggestion:

Divide each coordinate by your "epsilon" value of 0.1/0.2/whatever and round the result to an integer. This creates a "quotient space" of points where distance no longer needs to be determined using the distance formula, but simply by comparing the integer coordinates of each point. If all coordinates are the same, then the original points were within approximately the square root of three times epsilon from each other (for example). This process is O(n) and should take 0.001 seconds or less.

(Note: you would want to augment the original point with the three additional integers that result from this division and rounding, so that you don't lose the exact coordinates.)

Sort the points in numeric order using dictionary-style rules and considering the three integers in the coordinates as letters in words. This process is O(n * log(n)) and should take certainly less than your 1/10th of a second requirement.

Now you simply proceed through this sorted list and compare each point's integer coordinates with the previous and following points. If all coordinates match, then both of the matching points can be moved into your "keep" list of points, and all the others can be marked as "throw away." This is an O(n) process which should take very little time.

The result will be a subset of all the original points, which contains only those points that could be possibly involved in any bond, with a bond being defined as approximately epsilon or less apart from some other point in your original set.

This process is not mathematically exact, but I think it is definitely fast and suited for your purpose.

Joseph Myers
  • 6,434
  • 27
  • 36
  • +1. Great solution. I did not see it then type my post. I think that my approach is the same but you formulate your answer much better. – Roman Fursenko Mar 20 '13 at 03:53
  • Thank you! This approach seems pretty clever! I'm trying to implement this to see if it works OK in my case. – pygabriel Mar 20 '13 at 05:50
  • I've also found a similar approach to the one you mentioned thatn may work as well (after you sort the points, you can check for bonds in the neighbour "cells") http://en.wikipedia.org/wiki/Cell_lists – pygabriel Mar 25 '13 at 17:53
  • Thank you, that was actually the problem I was worried about, a small percentage of matches being left out because of rounding discontinuities. If you check for bonds in neighbor cells, that would definitely solve that problem. I had no idea this was a well-known existing research problem... when I read your problem it just seemed so tantalizing: like an O(n) solution was within reach even though on the surface it appears to be O(n^2). My idea was based on topology that I studied in math and a conversation with a physicist who happened to be nearby and liked the problem as well. – Joseph Myers Mar 25 '13 at 21:36
1

The first thing that comes to my mind is: If we calculate the distance between each two atoms in the set it will be O(N^2) operations. It is very slow. What about to introduce the statical orthogonal grid with some cells size (for example close to the distance you are interested) and then determine the atoms belonging to the each cell of the grid (it takes O(N) operations) After this procedure you can reduce the time for searching of the neighbors.

Roman Fursenko
  • 688
  • 4
  • 9