Index for fast point-(> 100.000.000)-in-polygon-(> 10.000)-test

Question

Problem

I'm working with openstreetmap-data and want to test for point-features in which polygon they lie. In total there are 10.000s of polygons and 100.000.000 of points. I can hold all this data in memory. The polygons usually have 1000s of points, hence making point-in-polygon-tests is very expensive.

Idea

I could index all polygons with an R-Tree, allowing me to only check the polygons whose bounding-box is hit.

Probable new problem

As the polygons are touching each other (think of administrative boundaries) there are many points in the bounding-box of more than one polygon, hence forcing many point-in-polygon-tests.

Question

Do you have any better suggestion than using an R-Tree?

Compute a [trapezoidal decomposition and do point location in it](http://en.wikipedia.org/wiki/Point_location)? — David Eisenstat, Sep 02 '14 at 14:41

score 1 · Answer 1 · answered Sep 04 '14 at 17:03

Quad-Trees will likely work worse than rasterization - they are essentially a repeated rasterization to 2x2 images... But definitely exploit rasterization for all the easy cases, because testing the raster should be as fast as it gets. And if you can solve 90% of your points easily, you have more time for the remaining.

Also make sure to first remove duplicates. The indexes often suffer from duplicates, and they are obviously redundant to test.

R*-trees are probably a good thing to try, but you need to really carefully implement them.

The operation you are looking for is a containment spatial join. I don't think there is any implementation around that you could use - but for your performance issues, I would carefully implement it myself anyway. Also make sure to tune parameters and profile your code!

The basic idea of the join is to build two trees - one for the points, one for the polygons. You then start with the root nodes of each tree, and repeat the following recursively until the leaf level:

If one is a non-directory node:
- If the two nodes do not overlap: return
- Decide by an heuristic (you'll need to figure this part out, "larger extend" may do for a start) which directory node to expand.
- Recurse into each new node, plus the other non-opened node as new pair.
Leaf nodes:
- fast test point vs. bounding box of polygon
- slow test point in polygon

You can further accelerate this if you have a fast interior-test for the polygon, in particular for rectangle-in-polygon. It may be good enough if it is approximative, as long as it is fast.

For more detailed information, search for r-tree spatial join.

score 0 · Answer 2 · answered Sep 02 '14 at 16:45

Try using quad trees.

Basically you can recursivelly partion space into 4 parts and then for each part you should know: a) polygons which are superset of given part b) polygons which intersect given part

This gives some O(log n) overhead factor which you might not be happy with.

The other option is to just partion space using grid. You should keep same information or each part of the grid as in the case above. This does only have some constant overhead.

Both this options assume, that the distribution of polygons is somehow uniform.

There is an other option, if you can process points offline (in other words you can pick the processing order of points). Then you can use some sweeping line techniques, where you sort points by one coordinate, you iterate over points in this sorted order and maintain only interesting set of polygons during iteration.

I'm actually using a tile-like indexing at the moment and rasterize the polygons using the edge-flat-algorithm. The problem is, that the preprocessing is slow. But I doubt that quad-trees would be any faster... :-( — user2033412, Sep 03 '14 at 13:12

Index for fast point-(> 100.000.000)-in-polygon-(> 10.000)-test

Problem

Idea

Probable new problem

Question

2 Answers2