7

Why is this not returning a count of number of points in each neighbourhoods (bounding box)?

import geopandas as gpd

def radius(points_neighbour, points_center, new_field_name, r):
    """
    :param points_neighbour:
    :param points_center:
    :param new_field_name: new field_name attached to points_center
    :param r: radius around points_center
    :return:
    """
    sindex = points_neighbour.sindex
    pts_in_neighbour = []
    for i, pt_center in points_center.iterrows():
        nearest_index = list(sindex.intersection((pt_center.LATITUDE-r, pt_center.LONGITUDE-r, pt_center.LATITUDE+r, pt_center.LONGITUDE+r)))
        pts_in_this_neighbour = points_neighbour[nearest_index]
        pts_in_neighbour.append(len(pts_in_this_neighbour))
    points_center[new_field_name] = gpd.GeoSeries(pts_in_neighbour)

Every loop gives the same result.

Second question, how can I find k-th nearest neighbour?

More information about the problem itself:

  • We are doing it at a very small scale e.g. Washington State, US or British Columbia, Canada

  • We hope to utilize geopandas as much as possible since it's similar to pandas and supports spatial indexing: RTree

  • For example, sindex here has method nearest, intersection, etc.

Please comment if you need more information. This is the code in class GeoPandasBase

@property
def sindex(self):
    if not self._sindex_generated:
        self._generate_sindex()
    return self._sindex

I tried Richard's example but it didn't work

def radius(points_neighbour, points_center, new_field_name, r):
    """
    :param points_neighbour:
    :param points_center:
    :param new_field_name: new field_name attached to points_center
    :param r: radius around points_center
    :return:
    """
    sindex = points_neighbour.sindex
    pts_in_neighbour = []
    for i, pt_center in points_center.iterrows():
        pts_in_this_neighbour = 0
        for n in sindex.intersection(((pt_center.LATITUDE-r, pt_center.LONGITUDE-r, pt_center.LATITUDE+r, pt_center.LONGITUDE+r))):
            dist = pt_center.distance(points_neighbour['geometry'][n])
            if dist < radius:
                pts_in_this_neighbour = pts_in_this_neighbour + 1
        pts_in_neighbour.append(pts_in_this_neighbour)
    points_center[new_field_name] = gpd.GeoSeries(pts_in_neighbour)

To download the shape file, goto https://catalogue.data.gov.bc.ca/dataset/hellobc-activities-and-attractions-listing and choose ArcView to download

ZHU
  • 904
  • 1
  • 11
  • 25
  • Could you post the code you use to generate the rtree? – Richard Jun 23 '17 at 18:23
  • @Richard points_neighbour.sindex is this what you wanted? – ZHU Jun 23 '17 at 18:33
  • Yes, that should be it. – Richard Jun 23 '17 at 18:37
  • Just wondering if you have the `points_neighbour.sindex` code? – Richard Jun 24 '17 at 19:51
  • @Richard see this http://geoffboeing.com/2016/10/r-tree-spatial-index-python/ – ZHU Jun 25 '17 at 06:03
  • You need to edit your question to include the code you are using; otherwise, you are wasting people's time by having them try to decipher the link you posted. – Richard Jun 25 '17 at 07:04
  • I have included the code I'm using, I don't know why you're asking. – ZHU Jun 25 '17 at 16:32
  • Because the way you _generate_ the r-tree index may be an important part of understanding why what you're trying to do doesn't work. But also because including a [minimum-working example](http://sscce.org/) is a pre-requisite to writing good questions here. Right now you are asking potential answerers to recreate work that you've already done. Rather, you should include enough functional code to reproduce an example of the problem. – Richard Jun 25 '17 at 21:28
  • @Richard this piece of code doesn't give me the desired result that's why I'm asking for help. But I checked the logic there's nothing I can find. The way I'm generating the r-tree is by geopandas. – ZHU Jun 26 '17 at 00:20
  • You should also post the part of your code where you generate the r-tree. We cannot know how `points_neighbour.sindex` is loaded from the code you've posted thus far. It would be especially helpful if you produced a working example code that demonstrates your problem. – Richard Jun 26 '17 at 01:28
  • It's in geopandas as an optional dependency. See http://geopandas.org/install.html Too large to copy the whole library here isn't it? – ZHU Jun 26 '17 at 01:41
  • I sense sarcasm, but I am only trying to help. Presumably there's a place in your code in which your points are added to the R-tree. You should post that place. Or perhaps they are magically indexed by dint of being in a special kind of data frame. If so, that, too, would be worth mentioning. – Richard Jun 26 '17 at 01:52
  • @Richard sorry about that. I'll post it now. Let me know if you need more. Yeah it's not the same as pandas, in a lot more ways. – ZHU Jun 26 '17 at 02:00

2 Answers2

5

Rather than answer your question directly, I'd argue that you're doing this wrong. After arguing this, I'll give a better answer.

Why you're doing it wrong

An r-tree is great for bounding-box queries in two or three Euclidean dimensions.

You are looking up longitude-latitude points on a two-dimensional surface curved in a three-dimensional space. The upshot is that your coordinate system will yield singularities and discontinuities: 180°W is the same as 180°E, 2°E by 90°N is close to 2°W by 90°N. The r-tree does not capture these sorts of things!

But, even if they were a good solution, your idea to take lat±r and lon±r yields a square region; rather, you probably want a circular region around your point.

How to do it right

  1. Rather than keeping the points in lon-lat format, convert them to xyz format using a spherical coordinate conversion. Now they are in a 3D Euclidean space and there are no singularities or discontinuities.

  2. Place the points in a three-dimensional kd-tree. This allows you to quickly, in O(log n) time, ask questions like "What are the k-nearest neighbours to this point?" and "What are all the points within a radius r of this points?" SciPy comes with an implementation.

  3. For your radius search, convert from a Great Circle radius to a chord: this makes the search in 3-space equivalent to a radius search on a circle wrapped to the surface of a sphere (in this case, the Earth).

Code for doing it right

I've implemented the foregoing in Python as a demonstration. Note that all spherical points are stored in (longitude,latitude)/(x-y) format using a lon=[-180,180], lat=[-90,90] scheme. All 3D points are stored in (x,y,z) format.

#/usr/bin/env python3

import numpy as np
import scipy as sp
import scipy.spatial

Rearth = 6371

#Generate uniformly-distributed lon-lat points on a sphere
#See: http://mathworld.wolfram.com/SpherePointPicking.html
def GenerateUniformSpherical(num):
  #Generate random variates
  pts      = np.random.uniform(low=0, high=1, size=(num,2))
  #Convert to sphere space
  pts[:,0] = 2*np.pi*pts[:,0]          #0-360 degrees
  pts[:,1] = np.arccos(2*pts[:,1]-1)   #0-180 degrees
  #Convert to degrees
  pts = np.degrees(pts)
  #Shift ranges to lon-lat
  pts[:,0] -= 180
  pts[:,1] -= 90
  return pts

def ConvertToXYZ(lonlat):
  theta  = np.radians(lonlat[:,0])+np.pi
  phi    = np.radians(lonlat[:,1])+np.pi/2
  x      = Rearth*np.cos(theta)*np.sin(phi)
  y      = Rearth*np.sin(theta)*np.sin(phi)
  z      = Rearth*np.cos(phi)
  return np.transpose(np.vstack((x,y,z)))

#Get all points which lie with `r_km` Great Circle kilometres of the query
#points `qpts`.
def GetNeighboursWithinR(qpts,kdtree,r_km):
  #We need to convert Great Circle kilometres into chord length kilometres in
  #order to use the kd-tree
  #See: http://mathworld.wolfram.com/CircularSegment.html
  angle        = r_km/Rearth
  chord_length = 2*Rearth*np.sin(angle/2)
  pts3d        = ConvertToXYZ(qpts)
  #See: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.query_ball_point.html#scipy.spatial.KDTree.query_ball_point
  #p=2 implies Euclidean distance, eps=0 implies no approximation (slower)
  return kdtree.query_ball_point(pts3d,chord_length,p=2,eps=0) 


##############################################################################
#WARNING! Do NOT alter pts3d or kdtree will malfunction and need to be rebuilt
##############################################################################

##############################
#Correctness tests on the North, South, East, and West poles, along with Kolkata
ptsll = np.array([[0,90],[0,-90],[0,0],[-180,0],[88.3639,22.5726]])
pts3d = ConvertToXYZ(ptsll)
kdtree = sp.spatial.KDTree(pts3d, leafsize=10) #Stick points in kd-tree for fast look-up

qptsll = np.array([[-3,88],[5,-85],[10,10],[-178,3],[175,4]])
GetNeighboursWithinR(qptsll, kdtree, 2000)

##############################
#Stress tests
ptsll = GenerateUniformSpherical(100000)    #Generate uniformly-distributed lon-lat points on a sphere
pts3d = ConvertToXYZ(ptsll)                 #Convert points to 3d
#See: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html
kdtree = sp.spatial.KDTree(pts3d, leafsize=10) #Stick points in kd-tree for fast look-up

qptsll = GenerateUniformSpherical(100)      #We'll find neighbours near these points
GetNeighboursWithinR(qptsll, kdtree, 500)
Richard
  • 56,349
  • 34
  • 180
  • 251
  • I'm doing it in a smaller scale actually, for example, Washington state. So I don't think singularities matters? – ZHU Jun 22 '17 at 02:33
  • Why can't RTree does the job? First filter out the intersecting boxes and then see if those possible matches do lie in the circle? – ZHU Jun 22 '17 at 02:35
  • I think you know what I'm asking. Can you give another implementation? Try utilize RTree-indexing in geopandas – ZHU Jun 22 '17 at 02:38
  • @ZHU: If you were just doing this on a small scale, you should have clearly stated that in your question. Please edit your question to do so now, and include anything else that might be relevant. – Richard Jun 22 '17 at 06:14
  • It's been edited. Anything else you would like to know? – ZHU Jun 22 '17 at 06:31
  • when is the last time you edited this answer? It looks different from when I last seen this – ZHU Jul 30 '17 at 08:10
  • @ZHU: Are you sure you're not mistaking it with the answer just above, which we put a lot more work into? I haven't edited this answer since the day I made it. – Richard Jul 30 '17 at 12:53
  • 1
    I've been fretting the problems of spatially indexing a sphere for a couple of weeks (and I've read several papers on working but complex system that would end up being heavier than all the rest of my code combined) when I suddenly recalled have passed over this suggestion last week. Elegant, simple, practical with off-the-shelf libraries, and best of all *no edge cases*. Thanks. – dmckee --- ex-moderator kitten Apr 15 '19 at 21:24
  • @dmckee: [This](https://github.com/r-barnes/dggridR) may also be useful to you, but it is not as turn-key. – Richard Apr 15 '19 at 22:15
4

I've attached code which should, with some minor modifications, do what you want.

I think your problem arose for one of two reasons:

  1. You were not correctly constructing the spatial index. Your responses to my comments suggested that you weren't wholly aware of how the spatial index was getting made.

  2. The bounding box for your spatial query was not built correctly.

I'll discuss both possibilities below.

Constructing the spatial index

As it turns out, the spatial index is constructed simply by typing:

sindex = gpd_df.sindex

Magic.

But from whence does gpd_df.sindex get its data? It assumes that the data is stored in a column called geometry in a shapely format. If you have not added data to such a column, it will raise a warning.

A correct initialization of the data frame would look like so:

#Generate random points throughout Oregon
x = np.random.uniform(low=oregon_xmin, high=oregon_xmax, size=10000)
y = np.random.uniform(low=oregon_ymin, high=oregon_ymax, size=10000)

#Turn the lat-long points into a geodataframe
gpd_df = gpd.GeoDataFrame(data={'x':x, 'y':y})
#Set up point geometries so that we can index the data frame
#Note that I am using x-y points!
gpd_df['geometry'] = gpd_df.apply(lambda row: shapely.geometry.Point((row['x'], row['y'])), axis=1)

#Automagically constructs a spatial index from the `geometry` column
gpd_df.sindex 

Seeing the foregoing sort of example code in your question would have been helpful in diagnosing your problem and getting going on solving it.

Since you did not get the extremely obvious warning geopandas raises when a geometry column is missing:

AttributeError: No geometry data set yet (expected in column 'geometry'.

I think you've probably done this part right.

Constructing the bounding box

In your question, you form a bounding box like so:

nearest_index = list(sindex.intersection((pt_center.LATITUDE-r, pt_center.LONGITUDE-r, pt_center.LATITUDE+r, pt_center.LONGITUDE+r)))

As it turns out, bounding boxes have the form:

(West, South, East, North)

At least, they do for X-Y styled-points, e.g. shapely.geometry.Point(Lon,Lat)

In my code, I use the following:

bbox = (cpt.x-radius, cpt.y-radius, cpt.x+radius, cpt.y+radius)

Working example

Putting the above together leads me to this working example. Note that I also demonstrate how to sort points by distance, answering your second question.

#!/usr/bin/env python3

import numpy as np
import numpy.random
import geopandas as gpd
import shapely.geometry
import operator

oregon_xmin = -124.5664
oregon_xmax = -116.4633
oregon_ymin = 41.9920
oregon_ymax = 46.2938

def radius(gpd_df, cpt, radius):
  """
  :param gpd_df: Geopandas dataframe in which to search for points
  :param cpt:    Point about which to search for neighbouring points
  :param radius: Radius about which to search for neighbours
  :return:       List of point indices around the central point, sorted by
                 distance in ascending order
  """
  #Spatial index
  sindex = gpd_df.sindex
  #Bounding box of rtree search (West, South, East, North)
  bbox = (cpt.x-radius, cpt.y-radius, cpt.x+radius, cpt.y+radius)
  #Potential neighbours
  good = []
  for n in sindex.intersection(bbox):
    dist = cpt.distance(gpd_df['geometry'][n])
    if dist<radius:
      good.append((dist,n))
  #Sort list in ascending order by `dist`, then `n`
  good.sort() 
  #Return only the neighbour indices, sorted by distance in ascending order
  return [x[1] for x in good]

#Generate random points throughout Oregon
x = np.random.uniform(low=oregon_xmin, high=oregon_xmax, size=10000)
y = np.random.uniform(low=oregon_ymin, high=oregon_ymax, size=10000)

#Turn the lat-long points into a geodataframe
gpd_df = gpd.GeoDataFrame(data={'x':x, 'y':y})
#Set up point geometries so that we can index the data frame
gpd_df['geometry'] = gpd_df.apply(lambda row: shapely.geometry.Point((row['x'], row['y'])), axis=1)

#The 'x' and 'y' columns are now stored as part of the geometry, so we remove
#their columns in order to save space
del gpd_df['x']
del gpd_df['y']

for i, row in gpd_df.iterrows():
  neighbours = radius(gpd_df,row['geometry'],0.5)
  print(neighbours)
  #Use len(neighbours) here to construct a new row for the data frame

(What I had been requesting in the comments is code that looks like the foregoing, but which exemplifies your problem. Note the use of random to succinctly generate a dataset for experimentation.)

Richard
  • 56,349
  • 34
  • 180
  • 251
  • @ZHU: That's not particularly descriptive. Does the example I provide work? If not, there may be a problem with your setup: it works on my end. If so, you need to figure out what you're doing differently. – Richard Jun 28 '17 at 19:00
  • See my edited question which includes your implementation, I got all counts 0. While using my original code, I'm counting all points overall 41231. – ZHU Jun 28 '17 at 19:01
  • @ZHU: My example breaks apart some of the functionality into separate functions. I don't think you lose speed by doing this, since so much of this is taking place in Python anyway. You may want to increase your level of abstraction (use more functions) so that you can verify each part separately. – Richard Jun 28 '17 at 19:03
  • No I mean when I count each neighbour: using my original code I got all 0, using your code I got all 41231, which is the total number of points to be considered as neighbours. – ZHU Jun 28 '17 at 19:04
  • @ZHU: And, again, you should post a _complete example_ in your question: no one can run the code you've posted without knowing, for instance, how the points are generated, and which libraries you are using. Edit my example bit by bit until it matches your needs and see where it breaks. – Richard Jun 28 '17 at 19:04
  • It's just not counting correctly, just as what my original code did. Not counting anything, intersection returns either the whole set of points or no points. – ZHU Jun 28 '17 at 19:05
  • Here you can download the datasets https://catalogue.data.gov.bc.ca/dataset/hellobc-activities-and-attractions-listing No extra libraries are used other than geopandas – ZHU Jun 28 '17 at 19:06
  • @ZHU: I'm not downloading the datasets. I expect you to post self-contained code using randomly-generated points, like I did, that reproduces the problem you are having. There's nothing else I can do until you are able to do that. – Richard Jun 28 '17 at 19:07
  • The problem I have is with this dataset only. I don't think it's sensible to generate random points. – ZHU Jun 28 '17 at 19:09
  • @ZHU: Have you tried it with random points covering the same geographic area as your dataset? – Richard Jun 28 '17 at 19:11
  • I'm not very professional in this field of QGIS. Can you at least try downloading the shape file? From what you describe there might be problem with the file? However so far as I checked the geolocation everything is fine. – ZHU Jun 28 '17 at 19:15
  • This is another set of points that can be used with the one above https://catalogue.data.gov.bc.ca/dataset/hellobc-festivals-and-events-listing – ZHU Jun 28 '17 at 19:17
  • 1
    @ZHU: The problem could be many places, including with your data. My example shows how to use randomly-generated points with the algorithm you propose. Please construct an example using randomly generated points that duplicates your problem. If you cannot do so, the problem may be with your data. If you can do so, then the problem is in your code and it will be easy to find because we will both be looking at the same thing. – Richard Jun 28 '17 at 19:24