Is there a faster way (in Python, with a CPU) of doing the same thing as the function below? I've used For
loops and if
statements and wondering if there is a faster way? It currently takes about 1 minute per 100 postcodes to run this function, and I have about 70,000 to get through.
The 2 dataframes used are:
postcode_df
which contains 71,092 rows, and columns:
- Postcode e.g. "BL4 7PD"
- Latitude e.g. 53.577653
- Longitude e.g. -2.434136
e.g.
postcode_df = pd.DataFrame({"Postcode":["SK12 2LH", "SK7 6LQ"],
"Latitude":[53.362549, 53.373812],
"Longitude":[-2.061329, -2.120956]})
air
which contains 421 rows, and columns:
- TubeRef e.g. "ABC01"
- Latitude e.g. 53.55108
- Longitude e.g. -2.396236
e.g.
air = pd.DataFrame({"TubeRef":["Stkprt35", "Stkprt07", "Stkprt33"],
"Latitude":[53.365085, 53.379502, 53.407510],
"Longitude":[-2.0763, -2.120777, -2.145632]})
The function loops through each postcode in postcode_df, and for each postcode loops through each TubeRef and calculates (using geopy
) the distance between them and saves the TubeRef with the shortest distance to the postcode.
The output df, postcode_nearest_tube_refs
, contains the nearest tube per postcode and contains columns:
- Postcode e.g. "BL4 7PD"
- Nearest Air Tube e.g. "ABC01
- Distance to Air Tube KM e.g. 1.035848
# define function to get nearest air quality monitoring tube per postcode
def get_nearest_tubes(constituency_list):
postcodes = []
nearest_tubes = []
distances_to_tubes = []
for postcode in postcode_df["Postcode"]:
closest_tube = ""
shortest_dist = 500
postcode_lat = postcode_df.loc[postcode_df["Postcode"]==postcode, "Latitude"]
postcode_long = postcode_df.loc[postcode_df["Postcode"]==postcode, "Longitude"]
postcode_coord = (float(postcode_lat), float(postcode_long))
for tuberef in air["TubeRef"]:
tube_lat = air.loc[air["TubeRef"]==tuberef, "Latitude"]
tube_long = air.loc[air["TubeRef"]==tuberef, "Longitude"]
tube_coord = (float(tube_lat), float(tube_long))
# calculate distance between postcode and tube
dist_to_tube = geopy.distance.distance(postcode_coord, tube_coord).km
if dist_to_tube < shortest_dist:
shortest_dist = dist_to_tube
closest_tube = str(tuberef)
# save postcode's tuberef with shortest distance
postcodes.append(str(postcode))
nearest_tubes.append(str(closest_tube))
distances_to_tubes.append(shortest_dist)
# create dataframe of the postcodes, nearest tuberefs and distance
postcode_nearest_tube_refs = pd.DataFrame({"Postcode":postcodes,
"Nearest Air Tube":nearest_tubes,
"Distance to Air Tube KM": distances_to_tubes})
return postcode_nearest_tube_refs
Libraries I'm using are:
import numpy as np
import pandas as pd
# !pip install geopy
import geopy.distance