Fast method to match geospatial datasets in Python

Question

I have a set of 2000 geospatial points (lon/lat), which I need to match with several other geospatial datasets (I am using Geopandas GeoDataFrames). I am using the sklearn BallTree function to find the neighbors within a certain radius of each point (in the function below, point is one of the 2000 points and right_gdf is the dataset that I need to get the neighbors from).

I am currently using a for-loop to loop through all of the 2000 points and find the neighbors for each of them. However, depending on the size of right_gdf, this can take a long time. I am sure there is a way to speed this process up, potentially with parallel computing, but I am struggling to find it. I tried to use Dask delayed to parellelise the loop (see code below) but somehow this takes even longer than the simple for loop.

# Function that finds a point's neighbors within a certain radius
def neighbours_radius(point, right_gdf, R=1):

  # Create tree from the right gdf (use haversine for lat/lon coordinates)
  tree = BallTree(right_gdf, leaf_size=40, metric='haversine')

  # Find indices of all neighbors within R
  indices = tree.query_radius(point, r=r)[0]

return indices


# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):

  # Load the gdf with the 2000 points
  base = gpd.read_file(...)

  # Empty list to fill in the indices of the neighbors
  neighbors = []

  # Loop through the points and find the neighbors within R.
  for i in range(len(base)):
      point = base.iloc[i:i+1,:]
      ind = neighbours_radius(point, right_gdf, R=R)

      # append index lists
      neighbors.append(ind)

return neighbors


# Function that loops through the 2000 points with Dask delayed
def knn_gpd_dask(right_gdf, R=75):

  # Load the gdf with the 2000 points
  base = gpd.read_file(...)

  # Empty list to fill in the indices of the neighbors
  neighbors = []

  # Loop through the points and find the neighbors within R.
  for i in range(len(base):
    point = base.iloc[i:i+1,:]
    ind = delayed(neighbours_radius)(point, right_gdf, R=R)

    # append index list
    neighbors.append(ind)
  
  result = compute(neighbors)

return result

Can anyone help me speed up this process?

score 0 · Answer 1 · answered Apr 07 '22 at 07:11

If you profile your code, I suspect you will find that creating the BallTree is taking up most of the time, because you are creating it 2000 times. You should try to create it only once, like this for example:

# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):

  # Load the gdf with the 2000 points
  base = gpd.read_file(...)

  # Create tree from the right gdf (use haversine for lat/lon coordinates)
  tree = BallTree(right_gdf, leaf_size=40, metric='haversine')

  # Empty list to fill in the indices of the neighbors
  neighbors = []

  # Loop through the points and find the neighbors within R.
  for i in range(len(base)):
      point = base.iloc[i:i+1,:]

      # Find indices of all neighbors within R
      indices = tree.query_radius(point, r=r)[0]

      # append index lists
      neighbors.append(indices )

return neighbors

Fast method to match geospatial datasets in Python

1 Answers1