0

I'm looking for a solution of following problem: I have table of thousands of point features in postgress 9.3. I would like to get rid of insignificant points. By insignificant I understand points with not many neighbors, what could be defined either by distance or by extended bbox. Preferably I'd like to count this neighbors and then decide what number of them is significant. Any advice welcome!

user30455
  • 9
  • 3
  • Do you mean you want to count how many neighbours each point has within a certain radius? – John Powell Aug 31 '14 at 15:09
  • You are perfectly right! With such a simplification that it can be either radius or just a simple envelope with diagonal of a certain length (estimate is enough). – user30455 Aug 31 '14 at 18:49
  • If this has answered the question, it is considered polite to accept the answer, if not, could you ask for further clarification? – John Powell Sep 11 '14 at 21:02

2 Answers2

3

The easiest way to do this is using a spatial self-join to count the number of points from one table that intersect the buffer of each point in another table (where both tables are the same, ie, you are looking at a Cartesian product, although this is mitigated by having a spatial index -- see below). Assuming you have a table called points and your point field is called geom and each point had an ID field, and you are interested in all points within, say, 1km of each point, you would do:

select a.id, count(b.id) as num_neighbors 
from 
    points a, points b 
where st_dwithin(a.geom, b.geom, 1000) and a.id != b.id 
group by a.id 
order by num_neighbors;

Use order by and limit x to find the x points with the lowest number of neighbors.

If you wanted to use a rectangular region instead of a circular region for calculating the number of neigbors, use the ST_Expand function (in conjunction with ST_Intersects) instead of ST_DWithin. However, this will be slower, as well as less accurate, as ST_DWithin uses the index directly.

For performance reasons on non-trivial table sizes make sure you have a spatial index on your geometry field.

EDIT 2: Following your question update, I suggest you try this aproach, following the update, set, from, where order that Postgres uses for updates from a subquery.

update point SET num_neighbors=agg.num_neighbors 
   from (
      select count(b.id) as num_neighbors, a.id 
      from 
          point a, point b 
       where st_dwithin(a.geom, b.geom, 1000) and a.id != b.id group by a.id) 
   agg 
where agg.id=point.id;

Use explain in front of this query and compare the expected execution time with the query you have. I don't have your exact data, but with a test set of random points, this query was orders of magnitude faster than the one you put in your updated question.

EDIT 1: Original answer edited to take account of much better approach using ST_DWithin as suggested by Mike T.

John Powell
  • 12,253
  • 6
  • 59
  • 67
  • In order to take advantage of the spatial index, use ST_DWithin instead of st_intersects(st_buffer(... – Mike T Aug 31 '14 at 21:39
  • @MikeT. Edited, thanks. I didn't think it would make so much difference, tested on non-trivial dataset. – John Powell Sep 01 '14 at 07:09
  • Thank you so much! I will try to implement and will get back with results! – user30455 Sep 01 '14 at 08:05
  • You probably want to add order by num_neighbors to get an ordering by the points with the least neighbors within some distance. I edited the question accordingly. – John Powell Sep 01 '14 at 08:39
  • Your proposed syntax is much more elegant but on my dataset execution takes even bit longer (220s)...actually isn't it strange? anyhow thank You very much for your efforts. Sorry for making a mess in answers/questions but I was not sure if I could put code to the comment content. I will follow your guidance in the future! – user30455 Sep 01 '14 at 21:01
  • That is very strange as the output of explain was very clearly in favor of the syntax I suggested. Often, though, with spatial queries the complexity of the spatial operations is much greater than the costs of disk fetches, which might explain it. You can try updating your statistics as outlined in spatial index section of docs. If all of this has helped, please consider accepting the answer. – John Powell Sep 02 '14 at 04:51
0

@John as I need to store results in my source table I slightly modified your statement (as I'm just a beginner in SQL it can be a bit childish):

UPDATE point SET num_neighbors =(
select count(b.id) as num_neighbors from 
    point a, point b 
where st_dwithin(a.geom, b.geom, 100) and a.id != b.id and a.id=point.id
group by a.id);

It takes around 200s for ~85k records which is acceptable but not ecstatic (without gist index it takes longer than for dinosaurs to become a petrol). If you see any method to speed it up (even compromising some accuracy) that would be awesome!.

regards Thomas

user30455
  • 9
  • 3
  • Thomas, answer updated with a more efficient version of this update query. In general, it is better to edit the question, than put a further question as an answer :D. Just a friendly pointer. – John Powell Sep 01 '14 at 20:15