2

Is there a faster way (in Python, with a CPU) of doing the same thing as the function below? I've used For loops and if statements and wondering if there is a faster way? It currently takes about 1 minute per 100 postcodes to run this function, and I have about 70,000 to get through.

The 2 dataframes used are:

postcode_df which contains 71,092 rows, and columns:

  • Postcode e.g. "BL4 7PD"
  • Latitude e.g. 53.577653
  • Longitude e.g. -2.434136

e.g.

postcode_df = pd.DataFrame({"Postcode":["SK12 2LH", "SK7 6LQ"],
                                    "Latitude":[53.362549, 53.373812],
                                    "Longitude":[-2.061329, -2.120956]})

air which contains 421 rows, and columns:

  • TubeRef e.g. "ABC01"
  • Latitude e.g. 53.55108
  • Longitude e.g. -2.396236

e.g.

air = pd.DataFrame({"TubeRef":["Stkprt35", "Stkprt07", "Stkprt33"],
                                    "Latitude":[53.365085, 53.379502, 53.407510],
                                    "Longitude":[-2.0763, -2.120777, -2.145632]})

The function loops through each postcode in postcode_df, and for each postcode loops through each TubeRef and calculates (using geopy) the distance between them and saves the TubeRef with the shortest distance to the postcode.

The output df, postcode_nearest_tube_refs, contains the nearest tube per postcode and contains columns:

  • Postcode e.g. "BL4 7PD"
  • Nearest Air Tube e.g. "ABC01
  • Distance to Air Tube KM e.g. 1.035848
# define function to get nearest air quality monitoring tube per postcode
def get_nearest_tubes(constituency_list):
    
    postcodes = []
    nearest_tubes = []
    distances_to_tubes = []
    
    for postcode in postcode_df["Postcode"]:
            closest_tube = ""
            shortest_dist = 500

            postcode_lat = postcode_df.loc[postcode_df["Postcode"]==postcode, "Latitude"]
            postcode_long = postcode_df.loc[postcode_df["Postcode"]==postcode, "Longitude"]
            postcode_coord = (float(postcode_lat), float(postcode_long))


            for tuberef in air["TubeRef"]:
                tube_lat = air.loc[air["TubeRef"]==tuberef, "Latitude"]
                tube_long = air.loc[air["TubeRef"]==tuberef, "Longitude"]
                tube_coord = (float(tube_lat), float(tube_long))

                # calculate distance between postcode and tube
                dist_to_tube = geopy.distance.distance(postcode_coord, tube_coord).km
                if dist_to_tube < shortest_dist:
                    shortest_dist = dist_to_tube
                    closest_tube = str(tuberef)

            # save postcode's tuberef with shortest distance
            postcodes.append(str(postcode))
            nearest_tubes.append(str(closest_tube))
            distances_to_tubes.append(shortest_dist)
            
    # create dataframe of the postcodes, nearest tuberefs and distance
    postcode_nearest_tube_refs = pd.DataFrame({"Postcode":postcodes, 
                                          "Nearest Air Tube":nearest_tubes, 
                                          "Distance to Air Tube KM": distances_to_tubes})

    return postcode_nearest_tube_refs

Libraries I'm using are:

import numpy as np
import pandas as pd
# !pip install geopy
import geopy.distance
code_to_joy
  • 569
  • 1
  • 9
  • 27
  • updated your post with sample input and expected output – deadshot Aug 24 '20 at 09:02
  • your question is still not clear see https://stackoverflow.com/help/minimal-reproducible-example for how to make a minimal reproducible example. – Yohai Magan Aug 24 '20 at 09:09
  • It is called the nearest neighbor search or the post-office problem. There is a wikipedia page: https://en.wikipedia.org/wiki/Nearest_neighbor_search – Andreas Vinter-Hviid Aug 24 '20 at 09:23
  • 700 minutes is less than 12 hours, so it should be just fine. Just don't forget to save the result afterwards – Andrey Sobolev Aug 24 '20 at 09:27
  • 1
    The geopandas package provide spatial indexes. [Lesson 3 of the AutoGIS 2019 course](https://automating-gis-processes.github.io/site/notebooks/L3/spatial_index.html) covers both geocoding and usage of r-trees. – Masklinn Aug 24 '20 at 09:30
  • 1
    Don't calculate the full distance matrix, go for the BallTree algorithm. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html It support the haversine distance, and scales much better than full distance matrix. My guess is that this takes seconds/minutes. Let me know if you need a full working example. Please provide a few data lines that are panda friendly – Willem Hendriks Aug 24 '20 at 09:47
  • 1
    @user3184950 (https://stackoverflow.com/users/3184950/user3184950) thanks. I've updated the question with the Pandas code for creating the input dataframes with some sample rows. Does this give you what you need? It would be great to see a full working example. – code_to_joy Aug 24 '20 at 10:05
  • 1
    Yup- that helps. I posted it and turns out it is less than 10 seconds. – Willem Hendriks Aug 24 '20 at 10:20

2 Answers2

2

An working example here, taking seconds (<10).

Import libraries

import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
import uuid

I generate some random data, this takes a second as well, but at least we have some realistic amounts.

np_rand_post = 5 * np.random.random((72000,2))
np_rand_post = np_rand_post + np.array((53.577653, -2.434136))

and use UUID for fake postcodes

postcode_df = pd.DataFrame( np_rand_post , columns=['lat', 'long'])
postcode_df['postcode'] = [uuid.uuid4().hex[:6] for _ in range(72000)]
postcode_df.head()

We do the same for the air

np_rand = 5 * np.random.random((500,2))
np_rand = np_rand + np.array((53.55108, -2.396236))

and again use uuid for fake ref

tube_df = pd.DataFrame( np_rand , columns=['lat', 'long'])
tube_df['ref'] = [uuid.uuid4().hex[:5] for _ in range(500)]
tube_df.head()

extract gps values as numpy

postcode_gps = postcode_df[["lat", "long"]].values
air_gps = tube_df[["lat", "long"]].values

create a balltree

postal_radians =  np.radians(postcode_gps)
air_radians = np.radians(air_gps)

tree = BallTree(air_radians, leaf_size=15, metric='haversine')

query for closest first

distance, index = tree.query(postal_radians, k=1)

Note that the distance is not in KM, you need to convert first.

earth_radius = 6371000
distance_in_meters = distance * earth_radius
distance_in_meters

And for instance get the ref with tube_df.ref[ index[:,0] ]

Willem Hendriks
  • 1,267
  • 2
  • 9
  • 15
0

You can use numpy to compute a distance matrix for any point in set A to any point in set B, then just take the point in set A that corresponds to the minimal distance.

import numpy as np
import pandas as pd

dfA = pd.DataFrame({'lat':np.random.uniform(0, 30, 3), 'lon':np.random.uniform(0, 30, 3), 'id':[1,2,3]})
dfB = pd.DataFrame({'lat':np.random.uniform(0, 30, 3), 'lon':np.random.uniform(0, 30, 3), 'id':['a', 'b', 'c']})
lat1 = dfA.lat.values.reshape(-1, 1)
lat2 = dfB.lat.values.reshape(1, -1)
lon1 = dfA.lon.values.reshape(-1, 1)
lon2 = dfB.lon.values.reshape(1, -1)
dists = np.sqrt((lat1 - lat2)**2 + (lon1-lon2)**2)
for id1, id2 in zip (dfB.id, dfA.id.iloc[np.argmin(dists, axis=1)]):
    print(f'the closest point in dfA to {id1} is {id2}')
Yohai Magan
  • 279
  • 1
  • 12
  • Does this only work if dfA and dfB are the same length? I get an error "IndexError: positional indexers are out-of-bounds" trying this with my dfA that contains more rows than dfB – code_to_joy Aug 24 '20 at 09:54
  • it does not depend on the length of A and B, you can try my solution with a different length data frame. – Yohai Magan Aug 24 '20 at 10:17